Before, we wrote crawlers, and the most used framework was scrapy. Today we use the newly released crawler framework feapder to develop crawlers and see what kind of experience it is.
Target website: aHR0cHM6Ly93d3cubGFnb3UuY29tLw==
Requirements: Collect job lists and job details, and the details need to be updated every 7 days.
For demonstration, the following only searches for posts related to crawlers
1. Research
1.1 List page
First of all, we need to see whether the page is dynamically rendered and whether the interface has anti-climbing.
To see if it is dynamic rendering, you can right-click to display the source code of the webpage, and then search for the content source code on the webpage to see if it exists. For example, if the first Zhiyi Technology in the search list matches 0, the preliminary judgment is dynamic rendering.
Or you can use the feapder command to download the source code of the webpage and view it.
The opened page is loading
Calling the response.open() command will produce a temp.html file in the working directory. The content is the source code returned by the current request. We click to view it. It is a section of js with security verification. Therefore, it can be inferred that the website has anti-climbing and early warning of difficulty upgrade
Feapder also supports the use of curl command request, the way is as follows:
Press F12, or right-click to check, open the debug window, refresh the page, click the request of the current page, copy as curl, return to the command line window, enter feapder shell - and paste the content just copied
It is found that carrying headers and cookies are not good, maybe some parameters can only be used once.
Research conclusion: anti-climbing on the list page, dynamic rendering of the page
ps: The normal god will continue to investigate, what is the list interface, how to crack the anti-crawl, but because I am a novice, I will not worry about it.
1.2 Details page
Similar to the list page survey, the conclusion is that there is anti-crawl, but the page is not dynamically rendered
2. Create the project
Open the command line tool and enter:
> feapder create -p lagou-spider
lagou-spider 项目生成成功
The generated project is as follows:
I use pycharm, right click first, and add this item to the workspace.
(Right click the project name, Mark Directory as -> Sources Root)
3. Write a listing page crawler
3.1 Create a crawler
> cd lagou-spider/spiders
> feapder create -s list_spider
ListSpider 生成成功
The generated code is as follows:
import feapder
class ListSpider(feapder.AirSpider):
def start_requests(self):
yield feapder.Request("https://www.baidu.com")
def parse(self, request, response):
print(response)
if __name__ == "__main__":
ListSpider().start()
This is an example of requesting Baidu, which can be run directly
3.2 Photo Reptiles
Issue tasks:
def start_requests(self):
yield feapder.Request("https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=", render=True)
Note that we carried the render parameter in the request to indicate whether to use the browser to render, because this list page is dynamically rendered, and there is anti-crawl, I am rather confused, so I used the rendering mode to avoid hair loss
Write analytical functions
Observe the page structure and write the following analytic function
def parse(self, request, response):
job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
for job in job_list:
job_name = job.xpath("./@data-positionname").extract_first()
company = job.xpath("./@data-company").extract_first()
salary = job.xpath("./@data-salary").extract_first()
job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()
print(job_name, company, salary, job_url)
We parsed the job name, company, salary, and job details address. Normal logic should send the details address as a task to get the details
def parse(self, request, response):
job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
for job in job_list:
job_name = job.xpath("./@data-positionname").extract_first()
company = job.xpath("./@data-company").extract_first()
salary = job.xpath("./@data-salary").extract_first()
job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()
print(job_name, company, salary, job_url)
yield feapder.Request(
job_url, callback=self.parse_detail, cookies=response.cookies.get_dict()
) # 携带列表页返回的cookie,回调函数指向详情解析函数
def parse_detail(self, request, response):
print(response.text)
# TODO 解析详情
But the requirement is that the details are updated every 7 days, and the list does not say that it needs to be updated. Therefore, in order to optimize, write a separate crawler for the details. This crawler is only responsible for the tasks of the data and production details of the list.
Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course!
QQ group: 705933274
3.3 Data storage
Create table
Job list data sheet lagou_job_list
CREATE TABLE `lagou_job_list` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增id',
`job_name` varchar(255) DEFAULT NULL COMMENT '职位名称',
`company` varchar(255) DEFAULT NULL COMMENT '公司',
`salary` varchar(255) DEFAULT NULL COMMENT '薪资',
`job_url` varchar(255) DEFAULT NULL COMMENT '职位地址',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Detailed task table lagu_job_detail_task
CREATE TABLE `lagou_job_detail_task` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`url` varchar(255) DEFAULT NULL,
`state` int(11) DEFAULT '0' COMMENT '任务状态(0未做,1完成,2正在做,-1失败)',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Data storage method
There are many ways to store data into the database, directly importing pymysql and then splicing sql statements into the database, or using the MysqlDB that comes with the framework. However, Feapder has a more convenient way of warehousing, automatic warehousing
Automatic storage of AirSpider is not supported, because it is relatively lightweight. In order to maintain the lightweight feature, the author does not support automatic storage for the time being. But distributed crawler Spider is supported, we will directly integrate class instead Spider can
class ListSpider(feapder.AirSpider):
Change to
class ListSpider(feapder.Spider):
Generate item
The item corresponds to the table one by one, and is related to the data storage mechanism, and can be generated by the feapder command.
First configure the database connection information, which is configured in the setting
Generate item:
> cd items
> feapder create -i lagou_job_list
> feapder create -i lagou_job_detail_task
Data storage
def parse(self, request, response):
job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
for job in job_list:
job_name = job.xpath("./@data-positionname").extract_first()
company = job.xpath("./@data-company").extract_first()
salary = job.xpath("./@data-salary").extract_first()
job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()
# 列表数据
list_item = lagou_job_list_item.LagouJobListItem()
list_item.job_name = job_name
list_item.company = company
list_item.salary = salary
list_item.job_url = job_url
yield list_item # 直接返回,框架实现批量入库
# 详情任务
detail_task_item = lagou_job_detail_task_item.LagouJobDetailTaskItem()
detail_task_item.url = job_url
yield detail_task_item # 直接返回,框架实现批量入库
The data is returned to the framework in the form of yield item, and the framework is automatically stored in batches
3.4 Overall code
import feapder
from items import *
class ListSpider(feapder.Spider):
def start_requests(self):
yield feapder.Request(
"https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=",
render=True,
)
def parse(self, request, response):
job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
for job in job_list:
job_name = job.xpath("./@data-positionname").extract_first()
company = job.xpath("./@data-company").extract_first()
salary = job.xpath("./@data-salary").extract_first()
job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()
# 列表数据
list_item = lagou_job_list_item.LagouJobListItem()
list_item.job_name = job_name
list_item.company = company
list_item.salary = salary
list_item.job_url = job_url
yield list_item # 直接返回,框架实现批量入库
# 详情任务
detail_task_item = lagou_job_detail_task_item.LagouJobDetailTaskItem()
detail_task_item.url = job_url
yield detail_task_item # 直接返回,框架实现批量入库
if __name__ == "__main__":
spider = ListSpider(redis_key="feapder:lagou_list")
spider.start()
redis_key is the location where the task queue is stored in redis.
Run directly and observe that the data has been automatically stored in the database
4. Write details crawler
Unlike list page crawlers, detailed data needs to be updated every 7 days.
In order to display time series data, we collect data every 7 days. The data needs to carry batch information and divide the data according to the 7-day dimension.
Before contacting the feapder framework, we need to consider taking tasks from the detailed task table to the crawler in batches, and also need to maintain the status of the tasks and the batch information mentioned above. And in order to ensure the timeliness of the data, it is necessary to monitor the collection progress, and writing a crawler is very cumbersome.
So how does Feapder do? In order to save space, give the complete code directly:
import feapder
from items import *
class DetailSpider(feapder.BatchSpider):
def start_requests(self, task):
task_id, url = task
yield feapder.Request(url, task_id=task_id, render=True)
def parse(self, request, response):
job_name = response.xpath('//div[@class="job-name"]/@title').extract_first().strip()
detail = response.xpath('string(//div[@class="job-detail"])').extract_first().strip()
item = lagou_job_detail_item.LagouJobDetailItem()
item.title = job_name
item.detail = detail
item.batch_date = self.batch_date # 获取批次信息,批次信息框架自己维护
yield item # 自动批量入库
yield self.update_task_batch(request.task_id, 1) # 更新任务状态
if __name__ == "__main__":
spider = DetailSpider(
redis_key="feapder:lagou_detail", # redis中存放任务等信息的根key
task_table="lagou_job_detail_task", # mysql中的任务表
task_keys=["id", "url"], # 需要获取任务表里的字段名,可添加多个
task_state="state", # mysql中任务状态字段
batch_record_table="lagou_detail_batch_record", # mysql中的批次记录表
batch_name="详情爬虫(周全)", # 批次名字
batch_interval=7, # 批次周期 天为单位 若为小时 可写 1 / 24
)
# 下面两个启动函数 相当于 master、worker。需要分开运行
# spider.start_monitor_task() # 下发及监控任务
spider.start() # 采集
We run spider.start_monitor_task() and spider.start() respectively, and after the crawler is over, observe the database
Task table : lagu_job_detail_task
All tasks have been completed, and the framework has a mechanism for resending tasks after missing tasks until all tasks have been completed
Data sheet : lagu_job_detail:
The data carries batch time information, and we can divide the data according to this time. The current batch is March 19th. If the batch occurs within 7 days, the next batch will be March 26th.
Repeatedly start the crawler during this batch. If there is no new task, the crawler will not grab
spider.start_monitor_task()
spider.start()
Batch table : lagu_detail_batch_record
The batch table is specified in the startup parameters and is automatically generated. The batch table records the capture status of each batch in detail, such as the total amount of tasks, the amount done, the amount of failure, whether it has been completed, and other information
5. Integration
At present, both the list crawler and the detailed crawler have been written. The running entry is distributed in two files, which is messy to manage. Feapder recommends to write it in main.py.
from feapder import ArgumentParser
from spiders import *
def crawl_list():
"""
列表爬虫
"""
spider = list_spider.ListSpider(redis_key="feapder:lagou_list")
spider.start()
def crawl_detail(args):
"""
详情爬虫
@param args: 1 / 2 / init
"""
spider = detail_spider.DetailSpider(
redis_key="feapder:lagou_detail", # redis中存放任务等信息的根key
task_table="lagou_job_detail_task", # mysql中的任务表
task_keys=["id", "url"], # 需要获取任务表里的字段名,可添加多个
task_state="state", # mysql中任务状态字段
batch_record_table="lagou_detail_batch_record", # mysql中的批次记录表
batch_name="详情爬虫(周全)", # 批次名字
batch_interval=7, # 批次周期 天为单位 若为小时 可写 1 / 24
)
if args == 1:
spider.start_monitor_task()
elif args == 2:
spider.start()
if __name__ == "__main__":
parser = ArgumentParser(description="xxx爬虫")
parser.add_argument(
"--crawl_list", action="store_true", help="列表爬虫", function=crawl_list
)
parser.add_argument(
"--crawl_detail", type=int, nargs=1, help="详情爬虫(1|2)", function=crawl_detail
)
parser.start()
View the start command:
> python3 main.py --help
usage: main.py [-h] [--crawl_list] [--crawl_detail CRAWL_DETAIL]
xxx爬虫
optional arguments:
-h, --help show this help message and exit
--crawl_list 列表爬虫
--crawl_detail CRAWL_DETAIL
详情爬虫(1|2)
Start the list crawler:
python3 main.py --crawl_list
Start the detailed crawler master
python3 main.py --crawl_detail 1
Start the detailed crawler worker
python3 main.py --crawl_detail 2
to sum up
This article takes a recruitment website as an example and introduces the whole process of collecting data using feapder. Which involves the use of AirSpider, Spider, BatchSpider three crawlers.
- The AirSpider crawler is relatively lightweight and has low learning costs. In the face of some small data volume, no need to continue crawling without breakpoints, and no need for distributed collection, this crawler can be used.
- Spider is a distributed crawler based on redis, suitable for massive data collection, and supports functions such as continuous crawling at breakpoints, crawler alarms, and automatic data storage.
- BatchSpider is a distributed batch crawler. For data that needs to be collected periodically, this crawler is given priority.
In addition to supporting browser rendering and downloading, feapder also supports pipeline, which can be customized by users to facilitate docking with other databases
Rich alarms are built into the framework, and we will be notified in time when there is a problem with the crawler to ensure the timeliness of the data
- Calculate crawler crawling speed in real time, estimate the remaining time, and predict whether it will time out within the specified crawling cycle
- Reptile stuck alarm
- Too many failed crawler tasks alarm, which may be caused by website template changes or blocking
- Download status monitoring