[python crawler] 15.Scrapy framework practice (crawling popular positions)

Preface

In the previous level, we learned about the Scrapy framework and the structure and working principles of the Scrapy crawler company.
Insert image description here
In the Scrapy crawler company, the engine is the largest boss, leading the four major departments of the scheduler, downloader, crawler and data pipeline.

These four departments all take orders from the engine and regard the needs of the engine as the highest demand.

Insert image description here
We also became familiar with the use of Scrapy through a practical project of crawling Douban's Top 250 books.
Insert image description here
In this level, I will take you to practice a larger project - using Scrapy to crawl recruitment information from recruitment websites.

You can use this to experience the feeling of being the CEO of a Scrapy crawler company, using code to control and operate the entire Scrapy operation.

So what recruitment websites do we crawl? Among many recruitment websites, I chose Zhiyouji. This website can search the latest positions from hundreds of recruitment websites across the country through indexing.

Insert image description here
Now, please use your browser to open the URL link of Zhiyouji (be sure to open it):

https://www.jobui.com/rank/company/view/beijing/

I am using the Beijing page. You can change the region according to your needs.

We first make a preliminary observation of this website so that we can clarify the crawling goals of the project.

clear goal

After opening the website, you will find: This is the regional enterprise ranking list of the Zhongyouji website, which contains the list of popular enterprises this month.

Insert image description here
Click [Beijing Acne Information Service Co., Ltd.] to jump to the company's details page. Then click [Recruitment] to see all the position information this company is recruiting.

Insert image description here
After preliminary observation, we can set the crawling goal as: first crawling the companies in the four lists of the enterprise rankings, and then crawling the recruitment information of these companies.

There are 10 companies in each list, and there are a total of 40 companies in the four lists. In other words, we must first crawl to these 40 companies from the enterprise rankings, then jump to the recruitment information pages of these 40 companies, and crawl to the company names, positions, work locations and recruitment requirements.

Analysis process

After clarifying the goals, we start the analysis process. First of all, we need to look at where the company information in the enterprise rankings is hidden.

Company information for enterprise rankings

Please right-click to open the "Inspection" tool, click Network, and refresh the page. Click on the 0th request beijing/, look at the Response, and see if there is any company information on the list.

Insert image description here
After searching, I found that all the company information on the four lists was in it. The company information describing the company rankings is hidden in the html.

<a>Now please click Elements, light the cursor, and then move the mouse to [Beijing Douyin Information Service Co., Ltd.]. At this time , the element containing the company information will be located .

Insert image description here
Clicking href="/company/10375749/" will jump to the company details page of ByteDance. The URL of the details page is:

https://www.jobui.com/company/17376344/

We can guess: /company/+number/ should be the company ID. With this observation, we can figure out the URL patterns of the company details pages on the list.

Insert image description here

Then, we only need to <a>extract the value of the href attribute of the element to construct the URL of each company's details page.

The URL of the company details page is constructed so that the recruitment information in the details page can be obtained later.

Now, let's analyze the structure of html and see how to <a>extract the value of the href attribute of the element.

Insert image description here

If you look closely at the structure of HTML, you will find that each company information is hidden in an <div class="c-company-list">element. This div tag contains <div class="company-logo-box">two <div class="company-content-box">divs.

The label we want <a>is <div class="company-logo-box">inside.

We want to get <a>the value of the href attribute of all elements. Of course we can't directly use find_all() to capture <a>tags. The reason is very simple: this page has too many <a>tags, which will capture a lot of information we don't want.

A safe solution is to grab the outermost <div>tag first, then grab the elements <div>in the tag <a>, and finally extract <a>the value of the element's href attribute. Just like peeling an onion, start with the outermost layer.

At this point in the analysis, we already know the URL pattern of the company details page and how to extract <a>the value of the href attribute of the element.

Next, what we need to analyze is the details page of each company.

Recruitment information on the company details page

We open the details page of [Beijing ByteDance Technology Co., Ltd.] and click [Recruitment]. At this time, the URL will change and there will be additional parameters for jobs.

Insert image description here

If you click on the detail pages of several companies and check the recruitment information, you will know that the URL patterns of company recruitment information are also regular.

Insert image description here
Next, we need to find where the company’s recruitment information exists.

Still on ByteDance's recruitment information page, right-click to open the "Inspect" tool, click Network, and refresh the page. We click on the 0th request jobs/, view the Response, and search to see if there is any recruitment information from this company.

Insert image description here

We found the recruitment information we wanted in Response. This shows that the company's recruitment information is still hidden in HTML.

Next, you should know what to analyze.

The analysis routine is the same. We know that the data is hidden behind HTML, and then we analyze the structure of HTML and find ways to extract the data we want.

Then click Elements as usual, then light the cursor and move the mouse to the company name.

Insert image description here
The company name is hidden in the text of the element <div class="company-banner-title">under the label <a>. Logically speaking, we can use the class attribute to locate <div>the label and take out <div>the text of the label to get the company name.

However, after several operations and experiments, I found that the Zhiyouji website will change the name of this tag every once in a while (maybe the tag name you see at this time may not necessarily be

)。

In order to ensure that the company name can be obtained, we changed to use the id attribute (id="companyH1") to locate this tag. In this way, no matter how the name of the tag is changed, we can still catch it.

Next, move the mouse to the position name to see how to extract the recruitment position information.

Insert image description here
You will find: the information of each position is hidden under a <div>label, the position name is in <a>the text of the element, the working location and position requirements are in <div class="job-desc">the element, the working location is in the first <span>tag, and the position requirements are in the second <span>tag .

After this analysis, the recruitment information we want, including company name, job title, work location and job requirements, has been clearly positioned.

Insert image description here
At this point, we have analyzed the entire crawling process, and the next step is to implement the code.

Code

Insert image description here
Let’s follow the normal usage of Scrapy step by step. First, we have to create a Scrapy project.

Create project

Remember how to create it? Open the terminal of the local computer (windows: Win+R, enter cmd; mac: command+space, search for "terminal"), jump to the directory where you want to save the project, and enter the command to create the Scrapy project: scrapy startproject jobui (jobui is the English name of the Zhiyouji website, here we use it as the name of the Scrapy project).

After creating the project, if you open the Scrapy project in the compiler of your local computer, you will see the following structure:

Insert image description here

Define items

When we just analyzed it, we had determined that the data to be crawled were company name, job title, work location and recruitment requirements.

So, now please write the code that defines the item.

Below is the code I wrote to define the item.

import scrapy

class JobuiItem(scrapy.Item):
#定义了一个继承自scrapy.Item的JobuiItem类
    company = scrapy.Field()
    #定义公司名称的数据属性
    position = scrapy.Field()
    #定义职位名称的数据属性
    address = scrapy.Field()
    #定义工作地点的数据属性
    detail = scrapy.Field()
    #定义招聘要求的数据属性

Create and write crawler files

After defining the item, what we need to do next is to create a crawler file in spiders and name it jobui_ jobs.

Insert image description here
Now, we can start writing code in this crawler file.

First import the required modules:

import scrapy  
import bs4
from ..items import JobuiItem

Next, write the core code of the crawler. I will take you to clarify the logic of the code first, so that you can understand and write the code more smoothly later.

In the previous steps of the analysis process, we know that we need to first capture the IDs of the 40 companies in the enterprise rankings. For example, the ID of ByteDance is /company/17376344/.

Insert image description here

Then use the captured company ID to construct the URL of each company's recruitment information. For example, ByteDance’s recruitment information website is https://www.jobui.com/company/17376344/jobs/

We need to encapsulate the URL of each company's recruitment information into a requests object. You may not understand why it should be encapsulated into a requests object here. Let me explain.

If we do not use Scrapy but use the requests library, generally we get a URL and need to use requests.get() and pass in the URL parameter to get the source code of the web page.

Insert image description here
In Scrapy, obtaining the source code of the web page will be assigned by the engine to the downloader, and we do not need to handle it ourselves. The reason why we need to construct a new requests object is to tell the engine what parameters our new request needs to pass in.

In this way, the engine can get the correct requests object and hand it over to the downloader for processing.

Now that we have constructed a new requests object, we have to define a matching new method for handling responses. In this way, we can extract the recruitment information data we want.

Okay, we have clarified the logic of the core code.

Insert image description here
Let's continue writing the core code.

#导入模块
import scrapy
import bs4
from ..items import JobuiItem

class JobuiSpider(scrapy.Spider):  
#定义一个爬虫类JobuiSpider
    name = 'jobui'                  
    #定义爬虫的名字为jobui
    allowed_domains = ['www.jobui.com']
    #定义允许爬虫爬取网址的域名——职友集网站的域名
    start_urls = ['https://www.jobui.com/rank/company/view/beijing/']
    #定义起始网址——职友集企业排行榜的网址(北京)

    def parse(self, response):
    #parse是默认处理response的方法
        bs = bs4.BeautifulSoup(response.text, 'html.parser')
        #用BeautifulSoup解析response(企业排行榜的网页源代码)
        company_list = bs.find('div',id="companyList").find_all('div',class_='c-company-list')
        #用find_all提取所有<div class="c-company-list">标签
        for company in company_list:
        #遍历company_list
            data = company.find('a')
            # 找到其中第一个<a>标签,即<div class="company-logo-box">下的<a>标签
            company_id = data['href']
            #提取出所有<a>元素的href属性的值,也就是公司id标识
            url = 'https://www.jobui.com{id}jobs'
            real_url = url.format(id=company_id)
            #构造出公司招聘信息的网址链接

Lines 6-13 of code: define the crawler class JobuiSpider, the name of the crawler jobui, the domain name and starting URL that the crawler is allowed to crawl.

You should be able to understand the rest of the code. We use the default parse method to process the response (the web page source code of the enterprise rankings); use BeautifulSoup to parse the response; and use the find_all method to extract data (company ID identification).

The company ID is <a>the value of the href attribute of the element. If we want to extract it, we must first grab all the outermost <div class="c-company-list">tags and then grab <a>the elements from them.

So a for loop is used here to extract the value of the href attribute of the element and successfully construct the URL of the company's recruitment information.

At this point in the code, we have completed the first two things in the core code logic: extracting the company ID of the enterprise rankings and constructing the URL of the company's recruitment information.

Insert image description here
The next step is to construct a new requests object and define new methods to handle responses.

Continue to improve the core code (please focus on the 30th line of code and the following code).

#导入模块
import scrapy
import bs4
from ..items import JobuiItem

class JobuiSpider(scrapy.Spider):  
#定义一个爬虫类JobuiSpider
    name = 'jobui'                  
    #定义爬虫的名字为jobui
    allowed_domains = ['www.jobui.com']
    #定义允许爬虫爬取网址的域名——职友集网站的域名
    start_urls = ['https://www.jobui.com/rank/company/view/beijing/']
    #定义起始网址——职友集企业排行榜的网址

    def parse(self, response):
    #parse是默认处理response的方法
        bs = bs4.BeautifulSoup(response.text, 'html.parser')
        #用BeautifulSoup解析response(企业排行榜的网页源代码)
        company_list = bs.find('div',id="companyList").find_all('div',class_='c-company-list')
        #用find_all提取所有<div class="c-company-list">标签
        for company in company_list:
        #遍历company_list
            data = company.find('a')
            # 找到其中第一个<a>标签,即<div class="company-logo-box">下的<a>标签
            company_id = data['href']
            #提取出所有<a>元素的href属性的值,也就是公司id标识
            url = 'https://www.jobui.com{id}jobs'
            real_url = url.format(id=company_id)
            #构造出公司招聘信息的网址链接
            yield scrapy.Request(real_url, callback=self.parse_job)
#用yield语句把构造好的request对象传递给引擎。用scrapy.Request构造request对象。callback参数设置调用parsejob方法。


    def parse_job(self, response):
        #定义新的处理response的方法parse_job(方法的名字可以自己起)
            bs = bs4.BeautifulSoup(response.text, 'html.parser')
            #用BeautifulSoup解析response(公司招聘信息的网页源代码)
            company = bs.find(id="companyH1").text
            #用find方法提取出公司名称
            datas = bs.find_all('div',class_="c-job-list")
            #用find_all提取<div class_="c-job-list">标签,里面含有招聘信息的数据
            for data in datas:
            #遍历datas
                item = JobuiItem()
                #实例化JobuiItem这个类
                item['company'] = company
                #把公司名称放回JobuiItem类的company属性里
                item['position']=data.find('a').find('h3').text
                #提取出职位名称,并把这个数据放回JobuiItem类的position属性里
                item['address'] = data.find_all('span')[0].text
                #提取出工作地点,并把这个数据放回JobuiItem类的address属性里
                item['detail'] = data.find_all('span')[1].text
                #提取出招聘要求,并把这个数据放回JobuiItem类的detail属性里
                yield item
                #用yield语句把item传递给引擎

You probably don’t understand the meaning of line 30: yield scrapy.Request(real_url, callback=self.parse_job). Let me explain it to you.

scrapy.Request is the class that constructs requests objects. real_url is the parameter of each company's recruitment information URL that we pass into the requests object.

The Chinese meaning of callback is callback. self.parse_job is our newly defined parse_job method. After passing the callback=self.parse_job parameter into the requests object, the engine will know that the next stop the response is going to is the parse_job() method.

The yield statement is used to pass the constructed requests object to the engine.

Line 34 of code is the new parse_job method we defined. This method is used to parse and extract data from company recruitment information.

By comparing the data positioning table of the recruitment information below, you should be able to better understand the subsequent code.

Insert image description here
Please note that the attributes of work location and recruitment requirements in the picture above have been corrected to text.

Extract the data of company name, job title, work location and recruitment requirements, and put these data into the JobuiItem class we defined.

Finally, use the yield statement to pass the item to the engine, and the entire core code is written! ヽ(゚∀゚)メ(゚∀゚)ノ

Store files

At this point, our entire project still lacks the step of storing data. In Level 7, we learned to use the csv module to store data in csv files and the openpyxl module to store data in Excel files.

In fact, in Scrapy, there are corresponding methods for storing data into csv files and Excel files. Let’s talk about the csv file first.

The method of storing into a csv file is relatively simple. Just add the following code to the settings.py file.

FEED_URI='./storage/data/%(name)s.csv'
FEED_FORMAT='csv'
FEED_EXPORT_ENCODING='ansi'

FEED_URI is the path to the exported file. './storage/data/%(name)s.csv' means to put the stored files into the data subfolder of the storage folder at the same level as the main.py file.

FEED_FORMAT is the export data format. You can get the csv format by writing csv.

FEED_EXPORT_ENCODING is the export file encoding. ansi is an encoding format on windows. You can also turn it into utf-8 for use on mac computers.

The method of storing into an Excel file is a little more complicated. We need to enable ITEM_PIPELINES in settings.py first. The setting method is as follows:

#需要修改`ITEM_PIPELINES`的设置代码:

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
    
    
#     'jobui.pipelines.JobuiPipeline': 300,
# }

Just uncomment ITEM_PIPELINES (delete the #).

#取消`ITEM_PIPELINES`的注释后:

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
     'jobui.pipelines.JobuiPipeline': 300,
}

Then, we can edit the pipelines.py file. To store it as an Excel file, we still use the openpyxl module to implement it. The code is as follows. Please read the comments:

import openpyxl

class JobuiPipeline(object):
#定义一个JobuiPipeline类,负责处理item
    def __init__(self):
    #初始化函数 当类实例化时这个方法会自启动
        self.wb =openpyxl.Workbook()
        #创建工作薄
        self.ws = self.wb.active
        #定位活动表
        self.ws.append(['公司', '职位', '地址', '招聘信息'])
        #用append函数往表格添加表头
        
    def process_item(self, item, spider):
    #process_item是默认的处理item的方法,就像parse是默认处理response的方法
        line = [item['company'], item['position'], item['address'], item['detail']]
        #把公司名称、职位名称、工作地点和招聘要求都写成列表的形式,赋值给line
        self.ws.append(line)
        #用append函数把公司名称、职位名称、工作地点和招聘要求的数据都添加进表格
        return item
        #将item丢回给引擎,如果后面还有这个item需要经过的itempipeline,引擎会自己调度

    def close_spider(self, spider):
    #close_spider是当爬虫结束运行时,这个方法就会执行
        self.wb.save('./jobui.xlsx')
        #保存文件
        self.wb.close()
        #关闭文件

Modify settings

Finally, we have to modify the default settings in the settings.py file in Scrapy: add request headers, and change ROBOTSTXT_OBEY=True to ROBOTSTXT_OBEY=False.

#需要修改的默认设置:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jobui (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

There is another default setting that we need to modify. The code is as follows:

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 0

We need to uncomment the DOWNLOAD_DELAY = 0 line (remove the #). DOWNLOAD_DELAY translates into Chinese as download delay. This line of code can control the speed of the crawler. Because the crawling speed of this project should not be too fast, we need to change the download delay time to 0.5 seconds.

The modified code is as follows:

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5

After modifying the settings, we can now run the code.

Insert image description here

Warm reminder: Due to the upgrade of the website's anti-crawling strategy, if you find that the response result of the request is not 200, just refer to the previous article to solve the problem.

Code practice

I have taken you through writing the core code, and now it is time for you to write the core code of this project yourself.

Tip 1: In addition to writing the core code, you also need to define item, modify settings, and run Scrapy.

Tip 2: To run Scrapy, import the cmdline module in the main.py file, and use the execute method to execute the command line of the terminal: scrapy crawl+project name

Tip 3: If you use main.py as the entry point to run, only the main.py file can be clicked to run.

Tip 4: Save as csv or Excel, you can decide by yourself. To store it as csv, you only need to change the settings file, and to store it as Excel, you also need to modify the pipelines.py file.

Let’s start practicing the code~

How's it going? Is it done?

Regardless of whether the code runs successfully or not, I applaud you! ヾ(o´∀`o)ノ

Summarize

At this point, we have completed a complete project using Scrapy. It involves the entire process of a crawler: obtaining data, parsing data, extracting data and storing data.

However, it must be pointed out that due to space limitations, there is still a lot of content that has not been discussed: How to use Scrapy’s built-in parser? How to request data with parameters? How to write cookies? How to send an email? How to connect with selenium? How to implement a super powerful distributed crawler... and so on.

If there are all kinds of things, if you talk about them all, you may need to add several new levels. But I don't think this is necessary.

First, you have learned to use BS to parse data, request data with parameters, add cookies, send emails, coroutines... you understand how they work.

Secondly, you already understand the composition and working principle of the Scrapy framework.

As long as you combine the two and practice, you can easily master it.

Today you just don’t know how to write grammatically. Grammar problems are the easiest to solve. We can read the official documents. Because you already have the above knowledge, this document will be very easy to read.

In the next level, I have prepared for you a general review of past knowledge, a summary of anti-crawler strategies, guidelines for future crawler learning, and a letter.

See you in the next level!

Guess you like

Origin blog.csdn.net/qq_41308872/article/details/132667253