Getting started with python scrapy, complete a crawler in 10 minutes


Before TensorFlow became hot, the reason many people learned python was because they wanted to write crawlers. Indeed, python, with its rich third-party libraries, is well suited for this kind of work.
Scrapy is an easy-to-learn and easy-to-use crawler framework. Although there are still many crawlers who need to write a lot of code due to the ever-changing complexity of the Internet, it is still a lot less work to have a relatively comprehensive and balanced basic framework.

frame mount

I'm sorry to use other people's websites as an example to be crawled. Let's start from scratch and take this site as an example to start a simple crawler journey.
Because of habit, this article uses python2 as the working environment.
The installation of the scrapy framework is very simple, just one line of command, provided you already have the pip package manager:

pip install scrapy

Build a crawler project

Because a crawler project can contain multiple crawler modules, usually one crawler project is enough for most people.
Building a project also requires only one line of command:

#scrapy startproject <工程名称>,例如:
scrapy startproject formoon

After the above command is executed, a formoon folder will be created in the current directory, and a crawler project will be created in it using the basic template.
Only executing scrapy without any parameters can give scrapy help, use scrapy 子命令 --helpit to see more help information.

Add a crawler to the project

First enter the project directory:

cd formoon

Then you can build the first crawler in the project:

#scrapy genspider <爬虫名称> <爬虫所应用的域名称>,例如:
scrapy genspider pages formoon.github.io

The above command will <工作目录>/formoon(这个是工程目录)/formoon/spiders/create a python program file pages.py under the path: path, and its default content:

# -*- coding: utf-8 -*-
import scrapy

class PagesSpider(scrapy.Spider):
    name = 'pages'
    allowed_domains = ['formoon.github.io']
    start_urls = ['http://formoon.github.io/']
    
    def parse(self, response):
        pass

write crawler

Assuming our needs are like this, the crawler crawls the entire https://formoon.github.io website, gets all the articles in it, lists the article title, article link address, and the article's release date.
By convention, the completed code is posted directly below, with a detailed explanation in the form of comments:

# -*- coding: utf-8 -*-
import scrapy

class PagesSpider(scrapy.Spider):
    name = 'pages'  #爬虫的名称,不可更改
    allowed_domains = ['formoon.github.io'] #域名称
    start_urls = ['https://formoon.github.io/'] #从这个网址开始执行爬虫,注意默认是http,修改成https
    #scrapy爬虫中不会主动修改页面中的链接,所以自己增加一个类变量用于将相对地址完整成为绝对地址。
    baseurl='https://formoon.github.io'
    
    def parse(self, response):
        #scrapy爬虫主要的难点是xpath和css选择器的使用,请在网上搜索相关资源弄清楚
        #爬虫使用相关选择器在整个html中定位自己所需要的节点及获取其中的数据
        for course in response.xpath('//ul/li'):
            #获取文章链接
            href = self.baseurl+course.xpath('a/@href').extract()[0]
            #获取文章标题
            title = course.css('.card-title').xpath('text()').extract()[0]
            #获取文章发布日期
            date = course.css('.card-type.is-notShownIfHover').xpath('text()').extract()[0]
            #显示结果
            print title,href,date
        for btn in response.css('.container--call-to-action').xpath('a'):
            href = btn.xpath('@href').extract()[0]
            name = btn.xpath('button/text()').extract()[0]
            #如果屏幕上有下一页按钮,则递归访问下一页的页面
            if name == u"下一页":  #注意python2中对于中文要显式的增加'u'前缀表示是unicode字符
                yield scrapy.Request(self.baseurl+href,callback=self.parse)

execute crawler

To execute the crawler use the following command:

scrapy crawl pages

The results obtained are as follows:

2018-04-16 16:26:14 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: formoon)
2018-04-16 16:26:14 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14 (default, Mar  9 2018, 23:57:12) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-16 16:26:14 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'formoon.spiders', 'SPIDER_MODULES': ['formoon.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'formoon'}
2018-04-16 16:26:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-04-16 16:26:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-16 16:26:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-16 16:26:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-16 16:26:15 [scrapy.core.engine] INFO: Spider opened
2018-04-16 16:26:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-16 16:26:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-16 16:26:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://formoon.github.io/robots.txt> (referer: None)
2018-04-16 16:26:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/> (referer: None)
大恒工业相机多实例使用 https://formoon.github.io/2018/04/04/daheng-camera/ 2018-04-04
图像识别基本算法之SURF https://formoon.github.io/2018/03/30/surf-feature/ 2018-03-30
macOS的OpenCL高性能计算 https://formoon.github.io/2018/03/23/mac-opencl/ 2018-03-23
量子计算及量子计算的模拟 https://formoon.github.io/2018/03/20/dlib-quantum-computing/ 2018-03-20
iPhone多次输入错误密码锁机后恢复 https://formoon.github.io/2018/03/18/IOS-Password-Recovery/ 2018-03-18
Mac版AppStore无法下载、升级错误处理 https://formoon.github.io/2018/03/18/appstore-item-temporarily-unavailabel/ 2018-03-18Mac上使用vs-code快速上手c语言学习 https://formoon.github.io/2018/03/10/vscode-on-mac/ 2018-03-10Mac上使用远程X11应用 https://formoon.github.io/2018/03/09/remote-xwindows/ 2018-03-09
Docker for mac上使用Kubernetes https://formoon.github.io/2018/03/07/docker-for-mac/ 2018-03-07
那些令人惊艳的TensorFlow扩展包和社区贡献模型 https://formoon.github.io/2018/03/03/TensorFlow-models/ 2018-03-03
swift异步调用和对象间互动 https://formoon.github.io/2018/03/02/macos-thread-and-appdelegate/ 2018-03-02dylib库嵌入macOS应用的方法 https://formoon.github.io/2018/02/27/macos-app-embed-dylib/ 2018-02-27
macOS使用内置驱动加载可读写NTFS分区 https://formoon.github.io/2018/02/19/macos-mount-ntfs-as-read-write/ 2018-02-19
mac应用启动时卡死在“验证...” https://formoon.github.io/2018/02/16/macos-stuck-verifying-app/ 2018-02-16
CrossOver和wine https://formoon.github.io/2018/02/16/crossover-wine-copy/ 2018-02-16
2018-04-16 16:26:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/2/> (referer: https://formoon.github.io/)
Mark https://formoon.github.io/2018/02/09/hello-world/ 2018-02-09
GreenPlum无法远程访问解决 https://formoon.github.io/2018/02/08/greenplum-on-centos/ 2018-02-08
rinetd:轻量级Linux端口转发工具 https://formoon.github.io/2018/02/06/linux-port-forward-tools/ 2018-02-06
Ubuntu16包依赖故障解决 https://formoon.github.io/2018/02/05/ubuntu-apt-error-of-package-depend/ 2018-02-05
iNode环境Windows 10配置固定IP地址 https://formoon.github.io/2018/02/02/win10-inode-2-ipaddress/ 2018-02-02
Ubuntu 16.04.03 LTS 安装CUDA/CUDNN/TensorFlow+GPU流水账 https://formoon.github.io/2018/01/31/ubuntu-cuda-cudnn-tensorflow-setting/ 2018-01-31
resource fork, Finder information, or similar detritus not allowed https://formoon.github.io/2018/01/29/xcode-compile-error-1/ 2018-01-29
macOS webview编程 https://formoon.github.io/2018/01/29/mac-webview-program/ 2018-01-29
新麦装机问题汇 https://formoon.github.io/2018/01/24/new-mac-install/ 2018-01-24
比特币核心算法ECDSA电子签名在线演示 https://formoon.github.io/2018/01/22/bitcoin-and-ecdsa/ 2018-01-22
从锅炉工到AI专家(11)(END) https://formoon.github.io/2018/01/18/tensorFlow-series-11/ 2018-01-18
gem update 升级错误解决 https://formoon.github.io/2018/01/18/gem-update-error-solve/ 2018-01-18
比特币核心概念及算法 https://formoon.github.io/2018/01/18/bitcoin-and-blockchain/ 2018-01-18
从锅炉工到AI专家(10) https://formoon.github.io/2018/01/17/tensorFlow-series-10/ 2018-01-17
Python2中文处理纪要 https://formoon.github.io/2018/01/17/python2-chn-process/ 2018-01-17
2018-04-16 16:26:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/3/> (referer: https://formoon.github.io/pages/2/)
从锅炉工到AI专家(9) https://formoon.github.io/2018/01/16/tensorFlow-series-9/ 2018-01-16
从锅炉工到AI专家(8) https://formoon.github.io/2018/01/15/tensorFlow-series-8/ 2018-01-15
从锅炉工到AI专家(7) https://formoon.github.io/2018/01/12/tensorFlow-series-7/ 2018-01-12
从锅炉工到AI专家(6) https://formoon.github.io/2018/01/11/tensorFlow-series-6/ 2018-01-11
从锅炉工到AI专家(5) https://formoon.github.io/2018/01/11/tensorFlow-series-5/ 2018-01-11
从锅炉工到AI专家(4) https://formoon.github.io/2018/01/10/tensorFlow-series-4/ 2018-01-10
Octave Fontconfig报错解决 https://formoon.github.io/2018/01/10/octave-fontconfig-warning/ 2018-01-10
5分钟搭建一个quic服务器 https://formoon.github.io/2018/01/10/5mins-support-quic/ 2018-01-10
从锅炉工到AI专家(3) https://formoon.github.io/2018/01/09/tensorFlow-series-3/ 2018-01-09
从锅炉工到AI专家(2) https://formoon.github.io/2018/01/08/tensorFlow-series-2/ 2018-01-08
从锅炉工到AI专家(1) https://formoon.github.io/2018/01/08/tensorFlow-series-1/ 2018-01-08
解决本博客在手机浏览器拖动卡顿问题 https://formoon.github.io/2018/01/04/solve-mobile-browser-pull-problem/ 2018-01-04
OpenCV中的照片剪裁 https://formoon.github.io/2018/01/04/opencv-photo-crop/ 2018-01-04
OpenCV中的亮度对比度调整及其自动均衡 https://formoon.github.io/2018/01/04/opencv-brightness-and-contrast/ 2018-01-04
Mac电脑C语言开发的入门帖 https://formoon.github.io/2018/01/03/c-hello-world-for-mac/ 2018-01-03
2018-04-16 16:26:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/4/> (referer: https://formoon.github.io/pages/3/)
如何看到微信小程序的源码 https://formoon.github.io/2018/01/02/wechat-mini-app-rd/ 2018-01-02
使用人工辅助点达成更优白平衡 https://formoon.github.io/2018/01/02/opencv-whitebalance-with-point-confirm/ 2018-01-02
不使用插件建立jekyll网站sitemap https://formoon.github.io/2017/12/29/sitemap_of_jekyll/ 2017-12-29
safari11如何访问自签名https网站 https://formoon.github.io/2017/12/29/safari-self-signed-https/ 2017-12-29
赶个时髦,给自己的博客添加一个微信二维码 https://formoon.github.io/2017/12/29/add-wechat-qrcode-on-your-blog/ 2017-12-29Docker/VMWare宠坏的孩子们,还记得QEMU吗? https://formoon.github.io/2017/12/28/qemu-on-mac/ 2017-12-28
在网页显示数学公式 https://formoon.github.io/2017/12/28/mathjax-in-page/ 2017-12-28
使用SDL2显示一张图片 https://formoon.github.io/2017/12/28/hello-world-sdl2/ 2017-12-28
如何规范的把进程放到Linux后台运行 https://formoon.github.io/2017/12/27/selinux-run-app-in-background/ 2017-12-27
两种方法操作其它mac应用的窗口 https://formoon.github.io/2017/12/27/move-other-app-window-on-mac/ 2017-12-27
自己动手,装一个液晶电视 https://formoon.github.io/2017/12/25/lcd-tv-diy/ 2017-12-25
半小时完成一个湿度温度计 https://formoon.github.io/2017/12/25/arduino-hygrothermograph/ 2017-12-25
MacPro4,1升级到MacPro5,1 https://formoon.github.io/2017/12/22/macpro41-upgrade/ 2017-12-22
CameraBox个人讲台客户端使用说明 https://formoon.github.io/2017/12/22/camerabox-manual/ 2017-12-22
一段使用Educast抠像混屏直播的视频展示 https://formoon.github.io/2017/12/21/streaming-mix/ 2017-12-21
2018-04-16 16:26:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/5/> (referer: https://formoon.github.io/pages/4/)
七牛对象存储的使用 https://formoon.github.io/2017/12/21/qiniu-storage/ 2017-12-21
Educast视频直播控制台使用说明 https://formoon.github.io/2017/12/21/educast-manual/ 2017-12-21
批量自动重命名音乐文件 https://formoon.github.io/2017/12/20/mp3-m4a-rename/ 2017-12-20Markdown文本发布到微信公众号文章 https://formoon.github.io/2017/12/20/markdown-to-html-and-wechat/ 2017-12-20
Javascript已加入AppleScript全家桶 https://formoon.github.io/2017/12/19/jxa-appscript/ 2017-12-19
分享一个很通用的Makefile https://formoon.github.io/2017/12/19/Makefile-skill/ 2017-12-19Mac电脑编译c51程序 https://formoon.github.io/2017/12/18/c51-on-mac/ 2017-12-18
golang子进程的启动和停止 https://formoon.github.io/2017/12/16/ubuntu-golang-stop-child-process/ 2017-12-16
Ubuntu16.04LTS appstreamcli报错的处理 https://formoon.github.io/2017/12/15/ubuntu-appstreamcli-error/ 2017-12-15
AngularJS2+调用原有的js脚本 https://formoon.github.io/2017/12/14/angular4-ts-and-local-js/ 2017-12-14
在国内使用golang的小技巧 https://formoon.github.io/2017/12/14/use-golang-in-china/ 2017-12-14
Angular2+的两个小技巧 https://formoon.github.io/2017/12/14/angular4-hotkeys-and-detect-browser/ 2017-12-14
Unix程序员的Win10二三事 https://formoon.github.io/2017/12/14/Unix%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84win10%E4%BA%8C%E4%B8%89%E4%BA%8B/ 2017-12-14Ubuntu上搭建kindle gtk开发环境 https://formoon.github.io/2017/12/13/hello-world-for-kindle/ 2017-12-13
苹果手机上下载的文件在哪里? https://formoon.github.io/2017/12/13/download-on-ios/ 2017-12-13
2018-04-16 16:26:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/6/> (referer: https://formoon.github.io/pages/5/)
K60平台智能车开发工作随手记 https://formoon.github.io/2017/12/11/smart-car-k60-develope/ 2017-12-11
使用Jekyll和github搭建自己的个人博客 https://formoon.github.io/2017/12/11/setting-your-own-jekyll-blog/ 2017-12-11
使用ffmpeg做简单的音视频剪辑 https://formoon.github.io/2017/12/11/ffmpeg-auido-video-edit/ 2017-12-11
安装Homebrew https://formoon.github.io/2017/12/08/install-homebrew-on-mac/ 2017-12-08Mac上安装ffmpeg https://formoon.github.io/2017/12/08/install-ffmpeg-on-mac/ 2017-12-08
Hello World https://formoon.github.io/2017/12/08/hello-world/ 2017-12-08
2018-04-16 16:26:19 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-16 16:26:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1779,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 7,
 'downloader/response_bytes': 57926,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 6,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 16, 8, 26, 19, 71963),
 'log_count/DEBUG': 8,
 'log_count/INFO': 7,
 'memusage/max': 50831360,
 'memusage/startup': 50827264,
 'request_depth_max': 5,
 'response_received_count': 7,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2018, 4, 16, 8, 26, 15, 15007)}
2018-04-16 16:26:19 [scrapy.core.engine] INFO: Spider closed (finished)

As you can see from the results, our crawler has executed and got the correct results. If you don't want to see the log output during execution, you can add --nologparameters as follows:

> scrapy crawl pages --nolog
大恒工业相机多实例使用 https://formoon.github.io/2018/04/04/daheng-camera/ 2018-04-04
图像识别基本算法之SURF https://formoon.github.io/2018/03/30/surf-feature/ 2018-03-30
macOS的OpenCL高性能计算 https://formoon.github.io/2018/03/23/mac-opencl/ 2018-03-23
量子计算及量子计算的模拟 https://formoon.github.io/2018/03/20/dlib-quantum-computing/ 2018-03-20
iPhone多次输入错误密码锁机后恢复 https://formoon.github.io/2018/03/18/IOS-Password-Recovery/ 2018-03-18
Mac版AppStore无法下载、升级错误处理 https://formoon.github.io/2018/03/18/appstore-item-temporarily-unavailabel/ 2018-03-18Mac上使用vs-code快速上手c语言学习 https://formoon.github.io/2018/03/10/vscode-on-mac/ 2018-03-10Mac上使用远程X11应用 https://formoon.github.io/2018/03/09/remote-xwindows/ 2018-03-09
Docker for mac上使用Kubernetes https://formoon.github.io/2018/03/07/docker-for-mac/ 2018-03-07
那些令人惊艳的TensorFlow扩展包和社区贡献模型 https://formoon.github.io/2018/03/03/TensorFlow-models/ 2018-03-03
swift异步调用和对象间互动 https://formoon.github.io/2018/03/02/macos-thread-and-appdelegate/ 2018-03-02dylib库嵌入macOS应用的方法 https://formoon.github.io/2018/02/27/macos-app-embed-dylib/ 2018-02-27
macOS使用内置驱动加载可读写NTFS分区 https://formoon.github.io/2018/02/19/macos-mount-ntfs-as-read-write/ 2018-02-19
mac应用启动时卡死在“验证...” https://formoon.github.io/2018/02/16/macos-stuck-verifying-app/ 2018-02-16
CrossOver和wine https://formoon.github.io/2018/02/16/crossover-wine-copy/ 2018-02-16
Mark https://formoon.github.io/2018/02/09/hello-world/ 2018-02-09
GreenPlum无法远程访问解决 https://formoon.github.io/2018/02/08/greenplum-on-centos/ 2018-02-08
rinetd:轻量级Linux端口转发工具 https://formoon.github.io/2018/02/06/linux-port-forward-tools/ 2018-02-06
Ubuntu16包依赖故障解决 https://formoon.github.io/2018/02/05/ubuntu-apt-error-of-package-depend/ 2018-02-05
iNode环境Windows 10配置固定IP地址 https://formoon.github.io/2018/02/02/win10-inode-2-ipaddress/ 2018-02-02
Ubuntu 16.04.03 LTS 安装CUDA/CUDNN/TensorFlow+GPU流水账 https://formoon.github.io/2018/01/31/ubuntu-cuda-cudnn-tensorflow-setting/ 2018-01-31
resource fork, Finder information, or similar detritus not allowed https://formoon.github.io/2018/01/29/xcode-compile-error-1/ 2018-01-29
macOS webview编程 https://formoon.github.io/2018/01/29/mac-webview-program/ 2018-01-29
新麦装机问题汇 https://formoon.github.io/2018/01/24/new-mac-install/ 2018-01-24
比特币核心算法ECDSA电子签名在线演示 https://formoon.github.io/2018/01/22/bitcoin-and-ecdsa/ 2018-01-22
从锅炉工到AI专家(11)(END) https://formoon.github.io/2018/01/18/tensorFlow-series-11/ 2018-01-18
gem update 升级错误解决 https://formoon.github.io/2018/01/18/gem-update-error-solve/ 2018-01-18
比特币核心概念及算法 https://formoon.github.io/2018/01/18/bitcoin-and-blockchain/ 2018-01-18
从锅炉工到AI专家(10) https://formoon.github.io/2018/01/17/tensorFlow-series-10/ 2018-01-17
Python2中文处理纪要 https://formoon.github.io/2018/01/17/python2-chn-process/ 2018-01-17
从锅炉工到AI专家(9) https://formoon.github.io/2018/01/16/tensorFlow-series-9/ 2018-01-16
从锅炉工到AI专家(8) https://formoon.github.io/2018/01/15/tensorFlow-series-8/ 2018-01-15
从锅炉工到AI专家(7) https://formoon.github.io/2018/01/12/tensorFlow-series-7/ 2018-01-12
从锅炉工到AI专家(6) https://formoon.github.io/2018/01/11/tensorFlow-series-6/ 2018-01-11
从锅炉工到AI专家(5) https://formoon.github.io/2018/01/11/tensorFlow-series-5/ 2018-01-11
从锅炉工到AI专家(4) https://formoon.github.io/2018/01/10/tensorFlow-series-4/ 2018-01-10
Octave Fontconfig报错解决 https://formoon.github.io/2018/01/10/octave-fontconfig-warning/ 2018-01-10
5分钟搭建一个quic服务器 https://formoon.github.io/2018/01/10/5mins-support-quic/ 2018-01-10
从锅炉工到AI专家(3) https://formoon.github.io/2018/01/09/tensorFlow-series-3/ 2018-01-09
从锅炉工到AI专家(2) https://formoon.github.io/2018/01/08/tensorFlow-series-2/ 2018-01-08
从锅炉工到AI专家(1) https://formoon.github.io/2018/01/08/tensorFlow-series-1/ 2018-01-08
解决本博客在手机浏览器拖动卡顿问题 https://formoon.github.io/2018/01/04/solve-mobile-browser-pull-problem/ 2018-01-04
OpenCV中的照片剪裁 https://formoon.github.io/2018/01/04/opencv-photo-crop/ 2018-01-04
OpenCV中的亮度对比度调整及其自动均衡 https://formoon.github.io/2018/01/04/opencv-brightness-and-contrast/ 2018-01-04
Mac电脑C语言开发的入门帖 https://formoon.github.io/2018/01/03/c-hello-world-for-mac/ 2018-01-03
如何看到微信小程序的源码 https://formoon.github.io/2018/01/02/wechat-mini-app-rd/ 2018-01-02
使用人工辅助点达成更优白平衡 https://formoon.github.io/2018/01/02/opencv-whitebalance-with-point-confirm/ 2018-01-02
不使用插件建立jekyll网站sitemap https://formoon.github.io/2017/12/29/sitemap_of_jekyll/ 2017-12-29
safari11如何访问自签名https网站 https://formoon.github.io/2017/12/29/safari-self-signed-https/ 2017-12-29
赶个时髦,给自己的博客添加一个微信二维码 https://formoon.github.io/2017/12/29/add-wechat-qrcode-on-your-blog/ 2017-12-29Docker/VMWare宠坏的孩子们,还记得QEMU吗? https://formoon.github.io/2017/12/28/qemu-on-mac/ 2017-12-28
在网页显示数学公式 https://formoon.github.io/2017/12/28/mathjax-in-page/ 2017-12-28
使用SDL2显示一张图片 https://formoon.github.io/2017/12/28/hello-world-sdl2/ 2017-12-28
如何规范的把进程放到Linux后台运行 https://formoon.github.io/2017/12/27/selinux-run-app-in-background/ 2017-12-27
两种方法操作其它mac应用的窗口 https://formoon.github.io/2017/12/27/move-other-app-window-on-mac/ 2017-12-27
自己动手,装一个液晶电视 https://formoon.github.io/2017/12/25/lcd-tv-diy/ 2017-12-25
半小时完成一个湿度温度计 https://formoon.github.io/2017/12/25/arduino-hygrothermograph/ 2017-12-25
MacPro4,1升级到MacPro5,1 https://formoon.github.io/2017/12/22/macpro41-upgrade/ 2017-12-22
CameraBox个人讲台客户端使用说明 https://formoon.github.io/2017/12/22/camerabox-manual/ 2017-12-22
一段使用Educast抠像混屏直播的视频展示 https://formoon.github.io/2017/12/21/streaming-mix/ 2017-12-21
七牛对象存储的使用 https://formoon.github.io/2017/12/21/qiniu-storage/ 2017-12-21
Educast视频直播控制台使用说明 https://formoon.github.io/2017/12/21/educast-manual/ 2017-12-21
批量自动重命名音乐文件 https://formoon.github.io/2017/12/20/mp3-m4a-rename/ 2017-12-20Markdown文本发布到微信公众号文章 https://formoon.github.io/2017/12/20/markdown-to-html-and-wechat/ 2017-12-20
Javascript已加入AppleScript全家桶 https://formoon.github.io/2017/12/19/jxa-appscript/ 2017-12-19
分享一个很通用的Makefile https://formoon.github.io/2017/12/19/Makefile-skill/ 2017-12-19Mac电脑编译c51程序 https://formoon.github.io/2017/12/18/c51-on-mac/ 2017-12-18
golang子进程的启动和停止 https://formoon.github.io/2017/12/16/ubuntu-golang-stop-child-process/ 2017-12-16
Ubuntu16.04LTS appstreamcli报错的处理 https://formoon.github.io/2017/12/15/ubuntu-appstreamcli-error/ 2017-12-15
AngularJS2+调用原有的js脚本 https://formoon.github.io/2017/12/14/angular4-ts-and-local-js/ 2017-12-14
在国内使用golang的小技巧 https://formoon.github.io/2017/12/14/use-golang-in-china/ 2017-12-14
Angular2+的两个小技巧 https://formoon.github.io/2017/12/14/angular4-hotkeys-and-detect-browser/ 2017-12-14
Unix程序员的Win10二三事 https://formoon.github.io/2017/12/14/Unix%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84win10%E4%BA%8C%E4%B8%89%E4%BA%8B/ 2017-12-14Ubuntu上搭建kindle gtk开发环境 https://formoon.github.io/2017/12/13/hello-world-for-kindle/ 2017-12-13
苹果手机上下载的文件在哪里? https://formoon.github.io/2017/12/13/download-on-ios/ 2017-12-13
K60平台智能车开发工作随手记 https://formoon.github.io/2017/12/11/smart-car-k60-develope/ 2017-12-11
使用Jekyll和github搭建自己的个人博客 https://formoon.github.io/2017/12/11/setting-your-own-jekyll-blog/ 2017-12-11
使用ffmpeg做简单的音视频剪辑 https://formoon.github.io/2017/12/11/ffmpeg-auido-video-edit/ 2017-12-11
安装Homebrew https://formoon.github.io/2017/12/08/install-homebrew-on-mac/ 2017-12-08Mac上安装ffmpeg https://formoon.github.io/2017/12/08/install-ffmpeg-on-mac/ 2017-12-08
Hello World https://formoon.github.io/2017/12/08/hello-world/ 2017-12-08

Advanced crawlers, items and pipelines

For most users, after the above step, the basic needs have been met. But there are still two mechanisms that can make the crawler work more clearly and smoothly and more powerfully.
Item is the basic unit for scrapy to process data. In fact, in the parse method of the crawler, an item object should be returned to express a basic data unit.
To use item, first modify <工程目录>/formoon/items.pythe file to define our own data structure:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class FormoonItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #以上是模板中已经有的内容,下面是我们自己增加的3个字段
    title = scrapy.Field()
    link = scrapy.Field()
    date = scrapy.Field()    

There are many advantages to using items to process basic data units, one of the more important is that you can use the pipeline mechanism that comes with scrapy. This pipeline mechanism provides three basic processing situations before the crawler starts working, after the work is completed, and processing each data unit, so that the structure of the program is very clearly divided, and it is easier to interface with complex later functions.
Edit <工程目录>/formoon/pipelines.pyfile:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class FormoonPipeline(object):
    total = 0   #我们自定义的变量,用于统计文章总数
    #open_spider方法在爬虫开始工作之前调用,通常可以初始化环境、打开数据库、打开文件等工作
    def open_spider(self, spider):
        #这里只显示一行文字作为示例
        print "open spider ..."
    #这个方法是最基本的方法,每次爬虫parse方法返回一个item的时候,都会调用这个函数,对基本的一个数据单元进行处理
    def process_item(self, item, spider):
        self.total += 1 #累计文章数
        #显示基本数据内容,通常可以在这个方法中对数据保存入库、触发分析动作等
        print("%s %s %s"%(item['date'],item['title'],item['link']))
        return item
    #所有链接处理完毕,结束爬虫工作时调用,通常可以用于关闭数据库、关闭文件等。
    def close_spider(self, spider):
        #作为示例,这里只是显示处理结果
        print u"共",self.total,u"篇文章"
        print "close spider ..."

With the above two basic definitions, it is necessary to connect the item and the pipeline. This configuration is in the settings.py file, which is usually blocked, which means that the item and pipeline mechanisms are usually not applicable. You can open it by deleting the comment symbol:

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'formoon.pipelines.FormoonPipeline': 300,
}

The last step is to modify the crawler program, and modify the original direct data display in the crawler to the standard returned item data unit. In order to compare with the original crawler, we directly add another crawler program to apply the new function:

scrapy genspider pagesnew formoon.github.io

As mentioned earlier, this will <工程目录>/formoon/spiders/create a pagesnew.py file in the directory to accommodate the new crawler. We edit this file:

# -*- coding: utf-8 -*-
import scrapy
from formoon.items import FormoonItem   #要引入我们自定义的item
class PagesnewSpider(scrapy.Spider):
    name = 'pagesnew'
    allowed_domains = ['formoon.github.io']
    start_urls = ['https://formoon.github.io/']

    baseurl='https://formoon.github.io'

    def parse(self, response):
        for course in response.xpath('//ul/li'):
            href = self.baseurl+course.xpath('a/@href').extract()[0]
            title = course.css('.card-title').xpath('text()').extract()[0]
            date = course.css('.card-type.is-notShownIfHover').xpath('text()').extract()[0]
            #区别从这里开始,我们删除了直接显示数据,初始化一个空白的item,将数据填充进去
            item = FormoonItem()
            item['date']=date
            item['link']=href
            item['title']=title
            yield item  #将数据返回
        for btn in response.css('.container--call-to-action').xpath('a'):
            href = btn.xpath('@href').extract()[0]
            name = btn.xpath('button/text()').extract()[0]
            if name == u"下一页":
                yield scrapy.Request(self.baseurl+href,callback=self.parse)

You see, using this mechanism in the crawler program makes the structure of the crawler program simple and clear.
Try to execute:

> scrapy crawl pagesnew --nolog
open spider ...
2018-04-04 大恒工业相机多实例使用 https://formoon.github.io/2018/04/04/daheng-camera/
2018-03-30 图像识别基本算法之SURF https://formoon.github.io/2018/03/30/surf-feature/
2018-03-23 macOS的OpenCL高性能计算 https://formoon.github.io/2018/03/23/mac-opencl/
2018-03-20 量子计算及量子计算的模拟 https://formoon.github.io/2018/03/20/dlib-quantum-computing/
2018-03-18 iPhone多次输入错误密码锁机后恢复 https://formoon.github.io/2018/03/18/IOS-Password-Recovery/
2018-03-18 Mac版AppStore无法下载、升级错误处理 https://formoon.github.io/2018/03/18/appstore-item-temporarily-unavailabel/
2018-03-10 在Mac上使用vs-code快速上手c语言学习 https://formoon.github.io/2018/03/10/vscode-on-mac/
2018-03-09 在Mac上使用远程X11应用 https://formoon.github.io/2018/03/09/remote-xwindows/
2018-03-07 Docker for mac上使用Kubernetes https://formoon.github.io/2018/03/07/docker-for-mac/
2018-03-03 那些令人惊艳的TensorFlow扩展包和社区贡献模型 https://formoon.github.io/2018/03/03/TensorFlow-models/
2018-03-02 swift异步调用和对象间互动 https://formoon.github.io/2018/03/02/macos-thread-and-appdelegate/
2018-02-27 将dylib库嵌入macOS应用的方法 https://formoon.github.io/2018/02/27/macos-app-embed-dylib/
2018-02-19 macOS使用内置驱动加载可读写NTFS分区 https://formoon.github.io/2018/02/19/macos-mount-ntfs-as-read-write/
2018-02-16 mac应用启动时卡死在“验证...” https://formoon.github.io/2018/02/16/macos-stuck-verifying-app/
2018-02-16 CrossOver和wine https://formoon.github.io/2018/02/16/crossover-wine-copy/
2018-02-09 Mark https://formoon.github.io/2018/02/09/hello-world/
2018-02-08 GreenPlum无法远程访问解决 https://formoon.github.io/2018/02/08/greenplum-on-centos/
2018-02-06 rinetd:轻量级Linux端口转发工具 https://formoon.github.io/2018/02/06/linux-port-forward-tools/
2018-02-05 Ubuntu16包依赖故障解决 https://formoon.github.io/2018/02/05/ubuntu-apt-error-of-package-depend/
2018-02-02 iNode环境Windows 10配置固定IP地址 https://formoon.github.io/2018/02/02/win10-inode-2-ipaddress/
2018-01-31 Ubuntu 16.04.03 LTS 安装CUDA/CUDNN/TensorFlow+GPU流水账 https://formoon.github.io/2018/01/31/ubuntu-cuda-cudnn-tensorflow-setting/
2018-01-29 resource fork, Finder information, or similar detritus not allowed https://formoon.github.io/2018/01/29/xcode-compile-error-1/
2018-01-29 macOS webview编程 https://formoon.github.io/2018/01/29/mac-webview-program/
2018-01-24 新麦装机问题汇 https://formoon.github.io/2018/01/24/new-mac-install/
2018-01-22 比特币核心算法ECDSA电子签名在线演示 https://formoon.github.io/2018/01/22/bitcoin-and-ecdsa/
2018-01-18 从锅炉工到AI专家(11)(END) https://formoon.github.io/2018/01/18/tensorFlow-series-11/
2018-01-18 gem update 升级错误解决 https://formoon.github.io/2018/01/18/gem-update-error-solve/
2018-01-18 比特币核心概念及算法 https://formoon.github.io/2018/01/18/bitcoin-and-blockchain/
2018-01-17 从锅炉工到AI专家(10) https://formoon.github.io/2018/01/17/tensorFlow-series-10/
2018-01-17 Python2中文处理纪要 https://formoon.github.io/2018/01/17/python2-chn-process/
2018-01-16 从锅炉工到AI专家(9) https://formoon.github.io/2018/01/16/tensorFlow-series-9/
2018-01-15 从锅炉工到AI专家(8) https://formoon.github.io/2018/01/15/tensorFlow-series-8/
2018-01-12 从锅炉工到AI专家(7) https://formoon.github.io/2018/01/12/tensorFlow-series-7/
2018-01-11 从锅炉工到AI专家(6) https://formoon.github.io/2018/01/11/tensorFlow-series-6/
2018-01-11 从锅炉工到AI专家(5) https://formoon.github.io/2018/01/11/tensorFlow-series-5/
2018-01-10 从锅炉工到AI专家(4) https://formoon.github.io/2018/01/10/tensorFlow-series-4/
2018-01-10 Octave Fontconfig报错解决 https://formoon.github.io/2018/01/10/octave-fontconfig-warning/
2018-01-10 5分钟搭建一个quic服务器 https://formoon.github.io/2018/01/10/5mins-support-quic/
2018-01-09 从锅炉工到AI专家(3) https://formoon.github.io/2018/01/09/tensorFlow-series-3/
2018-01-08 从锅炉工到AI专家(2) https://formoon.github.io/2018/01/08/tensorFlow-series-2/
2018-01-08 从锅炉工到AI专家(1) https://formoon.github.io/2018/01/08/tensorFlow-series-1/
2018-01-04 解决本博客在手机浏览器拖动卡顿问题 https://formoon.github.io/2018/01/04/solve-mobile-browser-pull-problem/
2018-01-04 OpenCV中的照片剪裁 https://formoon.github.io/2018/01/04/opencv-photo-crop/
2018-01-04 OpenCV中的亮度对比度调整及其自动均衡 https://formoon.github.io/2018/01/04/opencv-brightness-and-contrast/
2018-01-03 Mac电脑C语言开发的入门帖 https://formoon.github.io/2018/01/03/c-hello-world-for-mac/
2018-01-02 如何看到微信小程序的源码 https://formoon.github.io/2018/01/02/wechat-mini-app-rd/
2018-01-02 使用人工辅助点达成更优白平衡 https://formoon.github.io/2018/01/02/opencv-whitebalance-with-point-confirm/
2017-12-29 不使用插件建立jekyll网站sitemap https://formoon.github.io/2017/12/29/sitemap_of_jekyll/
2017-12-29 safari11如何访问自签名https网站 https://formoon.github.io/2017/12/29/safari-self-signed-https/
2017-12-29 赶个时髦,给自己的博客添加一个微信二维码 https://formoon.github.io/2017/12/29/add-wechat-qrcode-on-your-blog/
2017-12-28 被Docker/VMWare宠坏的孩子们,还记得QEMU吗? https://formoon.github.io/2017/12/28/qemu-on-mac/
2017-12-28 在网页显示数学公式 https://formoon.github.io/2017/12/28/mathjax-in-page/
2017-12-28 使用SDL2显示一张图片 https://formoon.github.io/2017/12/28/hello-world-sdl2/
2017-12-27 如何规范的把进程放到Linux后台运行 https://formoon.github.io/2017/12/27/selinux-run-app-in-background/
2017-12-27 两种方法操作其它mac应用的窗口 https://formoon.github.io/2017/12/27/move-other-app-window-on-mac/
2017-12-25 自己动手,装一个液晶电视 https://formoon.github.io/2017/12/25/lcd-tv-diy/
2017-12-25 半小时完成一个湿度温度计 https://formoon.github.io/2017/12/25/arduino-hygrothermograph/
2017-12-22 MacPro4,1升级到MacPro5,1 https://formoon.github.io/2017/12/22/macpro41-upgrade/
2017-12-22 CameraBox个人讲台客户端使用说明 https://formoon.github.io/2017/12/22/camerabox-manual/
2017-12-21 一段使用Educast抠像混屏直播的视频展示 https://formoon.github.io/2017/12/21/streaming-mix/
2017-12-21 七牛对象存储的使用 https://formoon.github.io/2017/12/21/qiniu-storage/
2017-12-21 Educast视频直播控制台使用说明 https://formoon.github.io/2017/12/21/educast-manual/
2017-12-20 批量自动重命名音乐文件 https://formoon.github.io/2017/12/20/mp3-m4a-rename/
2017-12-20 把Markdown文本发布到微信公众号文章 https://formoon.github.io/2017/12/20/markdown-to-html-and-wechat/
2017-12-19 Javascript已加入AppleScript全家桶 https://formoon.github.io/2017/12/19/jxa-appscript/
2017-12-19 分享一个很通用的Makefile https://formoon.github.io/2017/12/19/Makefile-skill/
2017-12-18 在Mac电脑编译c51程序 https://formoon.github.io/2017/12/18/c51-on-mac/
2017-12-16 golang子进程的启动和停止 https://formoon.github.io/2017/12/16/ubuntu-golang-stop-child-process/
2017-12-15 Ubuntu16.04LTS appstreamcli报错的处理 https://formoon.github.io/2017/12/15/ubuntu-appstreamcli-error/
2017-12-14 AngularJS2+调用原有的js脚本 https://formoon.github.io/2017/12/14/angular4-ts-and-local-js/
2017-12-14 在国内使用golang的小技巧 https://formoon.github.io/2017/12/14/use-golang-in-china/
2017-12-14 Angular2+的两个小技巧 https://formoon.github.io/2017/12/14/angular4-hotkeys-and-detect-browser/
2017-12-14 Unix程序员的Win10二三事 https://formoon.github.io/2017/12/14/Unix%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84win10%E4%BA%8C%E4%B8%89%E4%BA%8B/
2017-12-13 在Ubuntu上搭建kindle gtk开发环境 https://formoon.github.io/2017/12/13/hello-world-for-kindle/
2017-12-13 苹果手机上下载的文件在哪里? https://formoon.github.io/2017/12/13/download-on-ios/
2017-12-11 K60平台智能车开发工作随手记 https://formoon.github.io/2017/12/11/smart-car-k60-develope/
2017-12-11 使用Jekyll和github搭建自己的个人博客 https://formoon.github.io/2017/12/11/setting-your-own-jekyll-blog/
2017-12-11 使用ffmpeg做简单的音视频剪辑 https://formoon.github.io/2017/12/11/ffmpeg-auido-video-edit/
2017-12-08 安装Homebrew https://formoon.github.io/2017/12/08/install-homebrew-on-mac/
2017-12-08 在Mac上安装ffmpeg https://formoon.github.io/2017/12/08/install-ffmpeg-on-mac/
2017-12-08 Hello World https://formoon.github.io/2017/12/08/hello-world/81 篇文章
close spider ...

Reference link

scrapy Chinese document
xpath tutorial
css selector manual

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324500434&siteId=291194637