Table of contents
2.4.1, scrapy project structure
2.4.2. The attributes and methods of response
2.4.3, composition of scrapy architecture
2.4.4, scrapy working principle
3.3.1, enter the scrapy terminal
6. Log information and log level
1. Introduction
1.1, what is scrapy
scrapy is an application framework written to crawl website data and extract structured data. It can be used in a range of programs including data mining, information processing, or storing historical data.
1.2. Structured data
Similarly, data with the same structure is called structured data, as shown in the figure below.
1.3. Installation
Two, the use of scrapy
2.1. Create a scrapy project
1) Enter in the terminal: scrapy startproject project name
Note : The project name cannot start with a number and cannot contain Chinese
2) At this point, a new scrapy project will appear in the directory
2.2. Create a crawler file
1) Enter the spiders folder and create a crawler file: scrapy genspider crawler file name
scrapy genspider baidu http://www.baidu.com
Note : 1) You need to create a crawler file in the spiders folder
2) The domain name does not need to add the http protocol, scrapy will automatically add it
2) Created successfully:
3) baidu.py file content
Note : If the requested page ends with html , the last " / " needs to be removed
import scrapy
class BaiduSpider(scrapy.Spider):
# 爬虫名字:运行爬虫时使用
name = 'baidu'
# 允许访问的域名
allowed_domains = ['www.baidu.com']
# 起始的url地址,表示第一次访问的域名:
# start_urls = 'http://' + allowed_domains + '/'
start_urls = ['http://www.baidu.com/']
# 方法中response为爬取网页后的返回对象
# 类似于:response = urllib.request.urlopen(request)
def parse(self, response):
pass
2.3. Run the crawler code
1) Comment out ROBOTSTXT_OBEY in the settings.py file
2) Run: scrapy crawl crawler name
2.4, actual combat
2.4.1, scrapy project structure
The structure of the scrapy project: --project name --project name --spiders folder (storing crawler files) --__init__.py --tc.py (custom crawler file core function file) --__init__.py --items.py (what is included in the crawled data where the data structure is defined) --middlewares.py (middleware proxy) --pipelines.py (pipelines are used to process downloaded data) --settings.py (configuration file robots protocol ua definition, etc.)
2.4.2. The attributes and methods of response
method | effect |
response.text | Get the response string |
response.body | get binary data |
response.xpath | It can be directly an xpath method to parse the content in the response |
response.extract() | Extract the data attribute of the selector object |
response.extract_first() | Extract the first data of the selector list |
2.4.3, composition of scrapy architecture
2.4.4, scrapy working principle
三、scrapy shell
3.1, what is scrapy shell
3.2. Installation
3.3. Application
3.3.1, enter the scrapy terminal
(1)scrapy shell www.baidu.com
(2)scrapy shell http://www.baidu.com
(3)scrapy shell “http://www.baidu.com”
(4)scrapy shell “www.baidu.com”
3.3.2. Grammar
response object:
response.body
response.text
response.url
response.status
Response analysis:
response.xpath
response.extract_first()
response.css()
4. Crawl Spider
4.1. Introduction
4.2. Practical operation
5. Data storage
6. Log information and log level
Seven, scrapy's post request
8. Agent