Create and use python Scrapy, run Scrapy code, how to use the Scrapy framework to get data, Scrapy xpath usage tutorial.

To install Scrapy, open cmd and enter pip install scrapy

创建项目
scrapy startproject tencent

创建爬虫
scrapy genspider hr tencent.com

items.py explained

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class PachongItem(scrapy.Item):
    # define the fields for your item here like:
    #在此处定义项目的字段,如
    # name = scrapy.Field()
    pass
	

Field() inherits a dictionary, and the dictionary is naturally {key:value}

middlewares.py

This is a download middleware and crawler middleware, which encapsulates two classes:

class PachongSpiderMiddleware:#爬虫中间件
	pass

class PachongDownloaderMiddleware:#下载中间件
	pass


pipelines.py

# Define your item pipelines here
#在此处定义项目管道
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
#别忘了将管道添加到项目管道设置中
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class PachongPipeline:
    def process_item(self, item, spider):
        return item
	

If you want to use crawlers, don't forget to open this pipeline,

settings.py

Let's look at the last file again,

This document briefly introduces,

#scrapy项目名
BOT_NAME = 'pachong'#这个是我创建的爬虫名称

SPIDER_MODULES = ['pachong.spiders']#我的额爬虫再pachong目录下。
NEWSPIDER_MODULE = 'pachong.spiders'	

#robots是否遵守
ROBOTSTXT_OBEY = True#这里是robots协议,可以注释掉。

#这里是最大的并发量,他默认的是16,如果需要调整就把下面注释打开把调整32的数字
#Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

#这个代码是下载延迟,默认是注释的没有下载延迟,如果太快了可以打开限制速度
#DOWNLOAD_DELAY = 3


#下面是一个比较重要的

# Override the default request headers:
#覆盖默认请求头:

#这里是一个请求报头,我们需要添加我们网址的请求头
DEFAULT_REQUEST_HEADERS = {
    
    
	'Accept': 	       'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}


# Enable or disable spider middlewares
#启用或禁用spider中间件
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    
    
   'pachong.middlewares.PachongSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
#启用或禁用下载程序中间件
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    
    
 'pachong.middlewares.PachongDownloaderMiddleware': 543,
}



# Configure item pipelines
#配置项目管道
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
    
    
#    'pachong.pipelines.PachongPipeline': 300,
#}

Next, let's run the program first, and then enter the command terminal

scrapy crawl hr	

We will get the following data

2021-03-24 09:51:05 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: pachong)

09:51:05------------启用时间
Scrapy 2.4.1--------程序版本
 (bot: pachong)-----项目名

2021-03-24 09:51:05 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Windows-10-10.0.18362-SP0

 lxml 4.4.1.0------程序版本
 Twisted 20.3.0----程序版本
 Python 3.7.4------程序版本
 Windows-10-10.0.18362-SP0----------系统版本
 
2021-03-24 09:51:05 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-24 09:51:05 [scrapy.crawler] INFO: Overridden settings:
{
    
    'BOT_NAME': 'pachong',
 'NEWSPIDER_MODULE': 'pachong.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['pachong.spiders']}
 
 'ROBOTSTXT_OBEY': True,-----这里我们看到这个协议是true
 
 
2021-03-24 09:51:05 [scrapy.extensions.telnet] INFO: Telnet Password: 28a8213a0b08cff4
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'pachong.middlewares.PachongDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'pachong.middlewares.PachongSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-24 09:51:05 [scrapy.core.engine] INFO: Spider opened

Spider opened---这里有一个opened,就是我们的爬虫,下面开始加载我们的爬虫


2021-03-24 09:51:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-24 09:51:05 [hr] INFO: Spider opened: hr
2021-03-24 09:51:05 [hr] INFO: Spider opened: hr
2021-03-24 09:51:05 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-24 09:51:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://tencent.com/robots.txt> from <GET http://tencent.com/robots.txt>

Redirecting (302)----请求失败

2021-03-24 09:51:06 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://tencent.com/robots.txt> (referer: None)

Crawled (404)----请求失败

2021-03-24 09:51:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://tencent.com/> from <GET http://tencent.com/>

DEBUG: Redirecting (302)----请求失败

2021-03-24 09:51:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tencent.com/> (referer: None)

DEBUG: Crawled (200)----请求成功

2021-03-24 09:51:07 [scrapy.core.engine] INFO: Closing spider (finished)


我们发现请求他的robots协议失败,但是请求他的主网站成功,我们需要修改一下爬虫设置。


Closing spider (finished)------到这里爬虫就结束了。跳过下面这段,先看下文,


2021-03-24 09:51:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
    
    'downloader/request_bytes': 864,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 1185,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.168874,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 24, 1, 51, 7, 78523),
 'log_count/DEBUG': 4,
 'log_count/INFO': 12,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2021, 3, 24, 1, 51, 5, 909649)}
2021-03-24 09:51:07 [scrapy.core.engine] INFO: Spider closed (finished)

Let's modify the settings.py file

ROBOTSTXT_OBEY = False    #把他改成Flase

Add another request header

DEFAULT_REQUEST_HEADERS = {
    
    
    'Connection': 'keep-alive',
    'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://www.douban.com/',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}
2021-03-24 10:28:12 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: pachong)
2021-03-24 10:28:12 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Windows-10-10.0.18362-SP0
2021-03-24 10:28:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-24 10:28:12 [scrapy.crawler] INFO: Overridden settings:
{
    
    'BOT_NAME': 'pachong',
 'NEWSPIDER_MODULE': 'pachong.spiders',
 'SPIDER_MODULES': ['pachong.spiders']}
2021-03-24 10:28:12 [scrapy.extensions.telnet] INFO: Telnet Password: fc28b3411aa3a1b4
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',   
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',     
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',   
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]


2021-03-24 10:28:12 [scrapy.core.engine] INFO: Spider opened
2021-03-24 10:28:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-24 10:28:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-24 10:28:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.douban.com/> (referer: https://www.douban.com/)

DEBUG: Crawled (200)---我们把robots协议关闭后没有了,加入请求头现在就可以访问了。

****************************************************************************************************
<200 https://www.douban.com/>
****************************************************************************************************
2021-03-24 10:28:13 [scrapy.core.engine] INFO: Closing spider (finished)




2021-03-24 10:28:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
    
    'downloader/request_bytes': 666,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 18450,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.872716,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 24, 2, 28, 13, 404170),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 3, 24, 2, 28, 12, 531454)}
2021-03-24 10:28:13 [scrapy.core.engine] INFO: Spider closed (finished)

The request was successful. For the convenience of viewing the results, we need to close the redundant logs.

Add a sentence to the settings.py file

LOG_LEVEL='WARNING'

insert image description here

Then we run the code again to get the information without log, directly the data, as follows:

****************************************************************************************************
<200 https://www.douban.com/>
<class 'scrapy.http.response.html.HtmlResponse'>
{
    
    'name': '了不起的文明现场——一线考古队长带你探秘历史'}
<class 'str'>
{
    
    'name': '人人听得懂用得上的法律课'}
<class 'str'>
{
    
    'name': '如何读透一本书——12堂阅读写作训练课'}
<class 'str'>
{
    
    'name': '52倍人生——戴锦华大师电影课'}
<class 'str'>
{
    
    'name': '我们的女性400年——文学里的女性主义简史'}
<class 'str'>
{
    
    'name': '用性别之尺丈量世界——18堂思想课解读女性问题'}
<class 'str'>
{
    
    'name': '哲学闪耀时——不一样的西方哲学史'}
<class 'str'>
{
    
    'name': '读梦——村上春树长篇小说指南'}
<class 'str'>
{
    
    'name': '拍张好照片——跟七七学生活摄影'}
<class 'str'>
{
    
    'name': '白先勇细说红楼梦——从小说角度重解“红楼”'}
<class 'str'>
****************************************************************************************************

Next, we should start talking about how to crawl the data. Have you found that I have already crawled some of the data above.

Yes, the parser used is the requests.xpath attribute that comes with Scrapy to locate and value, and the function is the same as lxml

For example:

html=response.xpath('//ul[@class="time-list"]/li/a[2]/text()')

But the value we get back is like this

[<Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='了不起的文明现场——一线考古队长带你探秘历史'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text(s="time-list"]/li/a[2]/text()' data='52倍人生——戴锦华大师电影课'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='我们的女性400年——文学里的女性主义简
史'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='用性别之尺丈量世界——18堂思想课解读女性问题'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='哲学闪耀时——不一样的西方哲学史'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='读梦——村上春树长篇小说指南'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='拍张好照片——跟七七学生活摄影'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='白先勇细说红楼梦——从小说角度重
解“红楼”'>]

The value we extracted with lxml is a list, but it is also a list here, but we found that it is not a simple list without it. It needs to be divided into three pieces.

<Selector xpath='XXX' data='xxx'>

The first is the attribute, the second is the path we take the value of, and the third is the value we want.

In fact, we encapsulate some methods in our Scrapy, such as xpath and css in the selector object,

But there are several methods for us to get the date in the selector object

#旧方法
extract_first()----》返回值
extract----》返回列表
#新方法
get()----》返回值
getall()----》返回列表

Both the new method and the old method can be used. If the old method cannot be found, it will return a None, and if the new method cannot find it, it will return a null.

Let's try it with the new get()

html=response.xpath('//ul[@class="time-list"]/li/a[2]/text()').get()

返回
了不起的文明现场——一线考古队长带你探秘历史

That's right. Well, it returned, let’s talk about how to run it

The first method is to create a PY file in the same directory as scrapy.cfg

code show as below:

from scrapy import cmdline

cmdline.execute('scrapy crawl hr'.split())

This will run our framework crawler

Guess you like

Origin blog.csdn.net/weixin_50123771/article/details/115180880