1 Introduction
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
PS: If you need Python learning materials, you can click on the link below to get it by yourself
Python free learning materials, codes and exchange answers click to join
As a crawler, mastering a crawler framework is an essential skill, so as a novice, I would like to recommend "Scrapy" to you.
What is the specific "Scrapy", the role of these is not long-winded (all nonsense, Baidu has an introduction to Scrapy), time is precious, just go directly to the dry goods (the actual case takes you to experience the use of scrapy).
Next, we will conduct actual combat with the goal of "Station B"!
2. Scrapy introductory combat
1. Environmental preparation
Install scrapy
pip install scrapy
The scrapy library can be installed directly through the above command
2. Build a scrapy project
scrapy startproject Bili
Through the above command, you can create a project name: Bili's crawler project.
Here you can create a crawler project named Bili on the desktop
Project structure
Bili
├── Bili
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── __pycache__
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg
The role of each file
- scrapy.cfg: The overall configuration file of the project, usually without modification.
- Bili: The Python module of the project, the program will import Python code from here.
- Bili/items.py: Used to define the Item class used by the project. The Item class is a DTO (Data Transfer Object), which usually defines N attributes. This class needs to be defined by the developer.
- Bili/pipelines.py: The pipeline file of the project, which is responsible for processing the crawled information. This file needs to be written by the developer.
- Bili/settings.py: The configuration file of the project, in which the project-related configuration is carried out.
- Bili/spiders: Store the spiders needed by the project in this directory, and the spiders are responsible for crawling the information that the project is interested in.
3. Clearly crawl content
https://search.bilibili.com/all?keyword=%E8%AF%BE%E7%A8%8B&page=2
Take the above link as an example (Station B), crawl the title (title) and link (url) of the video
4. Define each class in the project
Items class
import scrapy
class BiliItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
# 视频标题
title = scrapy.Field()
# 链接
url = scrapy.Field()
The crawling fields are the title (title) and link (url) of the video, so for the two variables title and url
Define spider class
The purpose of the spider class is to customize web page parsing rules (there is no new scrapy project, you need to create it yourself).
Scrapy provides the scrapy genspider command for creating spiders. The syntax of the command is as follows:
scrapy genspider [options] <name> <domain>
Enter the Bili directory in the command line window, and then execute the following command to create a Spider:
scrapy genspider lyc "bilibili.com"
Run the above command, you can find a lyc.py file in the Bili /spider directory of the Bili project
Edit lyc.py
import scrapy
from Bili.items import BiliItem
class LycSpider(scrapy.Spider):
name = 'lyc'
allowed_domains = ['bilibili.com']
start_urls = ['https://search.bilibili.com/all?keyword=课程&page=2']
# 爬取的方法
def parse(self, response):
item = BiliItem()
# 匹配
for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
item['title'] = jobs_primary.xpath('./a/@title').extract()
item['url'] = jobs_primary.xpath('./a/@href').extract()
# 不能使用return
yield item
# pass
Modify the pipeline class
This class is the final processing of the crawled files, and is generally responsible for writing the crawled data to the file or database.
Here we output it to the console.
from itemadapter import ItemAdapter
class BiliPipeline:
def process_item(self, item, spider):
print("title:", item['title'])
print("url:", item['url'])
Modify the settings class
BOT_NAME = 'Bili'
SPIDER_MODULES = ['Bili.spiders']
NEWSPIDER_MODULE = 'Bili.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Bili (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# 配置默认的请求头
DEFAULT_REQUEST_HEADERS = {
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Bili.pipelines.BiliPipeline': 300,
}
The simple architecture of a Scarpy project is complete, we can run it and try it.
Startup project
scrapy crawl lyc
But there is only one page of content, we can parse the next page.
Add the following code to lyc.py
import scrapy
from Bili.items import BiliItem
class LycSpider(scrapy.Spider):
name = 'lyc'
allowed_domains = ['bilibili.com']
start_urls = ['https://search.bilibili.com/all?keyword=课程&page=2']
# 爬取的方法
def parse(self, response):
item = BiliItem()
# 匹配
for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
item['title'] = jobs_primary.xpath('./a/@title').extract()
item['url'] = jobs_primary.xpath('./a/@href').extract()
# 不能使用return
yield item
# 获取当前页的链接
url = response.request.url
# page +1
new_link = url[0:-1]+str(int(url[-1])+1)
# 再次发送请求获取下一页数据
yield scrapy.Request(new_link, callback=self.parse)
Next crawl
If executed again, it will be crawled page by page.
3. Summary
1. Through the actual case "Station B", create a scrapy project by hand, parse the web page, and finally successfully crawl the data and print (save)
2. Suitable for beginners to start scrapy, welcome to collect, analyze, and learn