Table of contents
A complete case application of scrapy
The basic syntax of css selectors
0
create new project
download scrapy
pip install scrapy
This command first creates a folder based on the project namescrpy
, and then creates a project under the folder. This step is the starting point of all subsequent codes.
scrpy startproject <项目名>
create new project
scrapy startproject my_scrapy
Create the first scrapy
crawler file pm
scrapy genspider pm imspm.com
If you want to run the project command, you must first enter the red underlined my_scrapy
folder, in order to control the project in the project directory.
cd my_scrapy
At this time spiders
, a file appears in the folder pm.py
, and the content of the file is as follows:
import scrapy
class PmSpider(scrapy.Spider):
name = 'pm'
allowed_domains = ['imspm.com']
start_urls = ['http://imspm.com/']
def parse(self, response):
pass
Test the operation of the scrapy crawler
Use the command scrapy crawl <spider>
, spider
which is the crawler file name generated above, if the following content appears, it means that the crawler is loaded correctly.
2022-11-12 15:27:02 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: my_scrapy)
how to use scrapy
scrapy
The workflow is very simple:
- Collect the source code of the first page;
- Parse the source code of the first page and get the link to the next page;
- Request the source code of the next page;
- Parse the source code and get the next page of source code;
- […]
- During the process, after the target data is extracted, it is saved.
scrapy
A complete case application
> scrapy startproject my_project 爬虫
> cd 爬虫
> scrapy genspider pm imspm.com
Get the project structure as follows:
scrapy.cfg
: configuration file path and deployment configuration;items.py
: the structure of the target data;middlewares.py
: middleware file;pipelines.py
: pipeline file;settings.py
: configuration information.
The number of code requests is 7, because it pm.py
is not added by default in the file www
. If this content is added, the number of requests becomes 4.
The file code now pm.py
looks like this:
import scrapy
class PmSpider(scrapy.Spider):
name = 'pm'
allowed_domains = ['www.imspm.com']
start_urls = ['http://www.imspm.com/']
def parse(self, response):
print(response.text)
Wherein parse
indicates start_urls
the address in the request, the callback function after obtaining the response, and directly outputs the source code of the webpage through the attributes response
of the parameters..text
After obtaining the source code, it is necessary to parse and store the source code.
Before storing, a data structure needs to be defined manually. The content items.py
is implemented in the file, and the class name in the code is modified, MyProjectItem
→ ArticleItem
.
import scrapy
class ArticleItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() # 文章标题
url = scrapy.Field() # 文章地址
author = scrapy.Field() # 作者
Modify the functions pm.py
in the file parse
and add operations related to webpage parsing. This operation is similar to pyquery
knowledge points, and you can master it by directly observing the code.
def parse(self, response):
# print(response.text)
list_item = response.css('.list-item-default')
# print(list_item)
for item in list_item:
title = item.css('.title::text').extract_first() # 直接获取文本
url = item.css('.a_block::attr(href)').extract_first() # 获取属性值
author = item.css('.author::text').extract_first() # 直接获取文本
print(title, url, author)
The response.css
method returns a list of selectors, which can be iterated and css
methods invoked on the objects in it.
item.css('.title::text')
, get the text in the label;item.css('.a_block::attr(href)')
, get the tag attribute value;extract_first()
: parse the first item in the list;extract()
: Get the list.
pm.py
Import items.py
the class in , ArticleItem
and then modify it according to the following code:
def parse(self, response):
# print(response.text)
list_item = response.css('.list-item-default')
# print(list_item)
for i in list_item:
item = ArticleItem()
title = i.css('.title::text').extract_first() # 直接获取文本
url = i.css('.a_block::attr(href)').extract_first() # 获取属性值
author = i.css('.author::text').extract_first() # 直接获取文本
# print(title, url, author)
# 对 item 进行赋值
item['title'] = title
item['url'] = url
item['author'] = author
yield item
At this time, when the crawler is running scrapy
, the following prompt message will appear.
At this point a single page crawler is complete
Next, parse
modify the function again so that after parsing the first page, it can parse the data of the second page.
def parse(self, response):
# print(response.text)
list_item = response.css('.list-item-default')
# print(list_item)
for i in list_item:
item = ArticleItem()
title = i.css('.title::text').extract_first() # 直接获取文本
url = i.css('.a_block::attr(href)').extract_first() # 获取属性值
author = i.css('.author::text').extract_first() # 直接获取文本
# print(title, url, author)
# 对 item 进行赋值
item['title'] = title
item['url'] = url
item['author'] = author
yield item
next = response.css('.nav a:nth-last-child(2)::attr(href)').extract_first() # 获取下一页链接
# print(next)
# 再次生成一个请求
yield scrapy.Request(url=next, callback=self.parse)
In the above code, the variable next
represents the address of the next page, and response.css
the link is obtained through the function. css
Please focus on learning the selector.
Introduction to css selectors
- The selector in css is a mode for selecting elements that need to be styled. CSS implements one-to-one, one-to-many or many-to-one control on the elements in the html page, all need to use css selector, html The elements in the page are controlled by css selectors;
The basic syntax of css selectors
- Class selector : the class attribute of the element, such as
class="box"
selecting an element whose class is box; - ID selector : the id attribute of the element, for example,
id="box"
it means to select the element whose id is box; - Element selector : directly select document elements, such as p means to select all p elements, div means to select all div elements;
- Attribute selector : Select elements with a certain attribute, such as
*[title]
selecting alltitle
elements containing attributes,a[href]
indicating selecting all a elements with href attributes, etc.; - Descendant selector : Select elements that contain descendants of elements, for
li a
example, select all a elements under all li; - Child element selector : Select an element that is a child element of an element, such as
h1 > strong
selecting all strong elements whose parent element is h1; - Adjacent sibling selector : Select an element immediately after another element, and both have the same parent element, such as h1 + p means select all p elements immediately after the h1 element;
How to use css in scrapy
Take the a element as an example
response.css('a')
: Returns the selector object;response.css('a').extract()
: Returns the a tag object;response.css('a::text').extract_first()
: Returns the value of the text in the first a tag;response.css('a::attr(href)').extract_first()
: Returns the value of the href attribute in the first a tag;response.css('a[href*=image]::attr(href)').extract()
: Returns the value of the href attribute containing image in all a tags;response.css('a[href*=image] img::attr(src)').extract()
: Returns the src attribute of the image tag under all a tags;
yield scrapy.Request(url=next, callback=self.parse)
Indicates that a request is created again, and the callback function of the request is parse
itself. The effect of the code operation is as follows.
If you want to save the running results, just run the following command.
scrapy crawl pm -o pm.json
If you want to store each piece of data as a single row, use the following command scrapy crawl pm -o pm.jl
.
The generated file also supports csv, xml, marchal, pickle, you can try it yourself.
Let's use the data pipeline to
open pipelines.py
the file, modify the class name MyProjectPipeline
→ TitlePipeline
, and then compile the following code:
class TitlePipeline:
def process_item(self, item, spider): # 移除标题中的空格
if item["title"]:
item["title"] = item["title"].strip()
return item
else:
return DropItem("异常数据")
This code is used to remove the left and right spaces in the title.
After writing, you need to settings.py
open ITEM_PIPELINES
the configuration in the file.
ITEM_PIPELINES = {
'my_project.pipelines.TitlePipeline': 300,
}
300
It is PIPELINES
the priority order of operation, which can be modified as needed. Run the crawler code again, and you will find that the left and right spaces of the title have been removed.
At this point, a basic crawler of scrapy has been written.