Python Scrapy crawling papers, and solve problems Unhandled error in Deferred

Foreword

Due to a recent survey some of the latest research aspects of machine learning technology, they need to look at some relevant aspects of the paper, here to write a simple script reptile, it is very simple, using the framework Scrapy

Unhandled error encountered in practice the process in Deferred error, the answer given by most of the Internet is that because of the problem pypiwin32, can I pypiwin32 is no problem, but I could have been Unhandled error in Deferred error, it is silent ah, I finally find yourself a shield that is running Log information

scrapy crawl arxiv --nolog

Ever since I can not locate the specific went wrong, so if you have encountered the same situation, or look at the log bar

scrapy crawl arxiv

My situation is finally locate the file error is piplines.py you, look at the following specific reasons, in short, I want to say is, encountered Unhandled error in Deferred question, perhaps piplines.py some errors, you should check the positioning of

This article is for students just getting started

Uses a progressive manner demonstrate, until the ultimate goal, the role of each file Scrapy to have carried out a comparative demonstration

All Code: https://github.com/Mryangkaitong/python_Crawler/tree/master/MolMachineLearning

 recommend:

Blog entry: https://www.runoob.com/w3cnote/scrapy-detail.html

Scrapy English documentation: https://doc.scrapy.org/en/latest/topics/spiders.html

Scrapy Chinese document: https://oner-wv.gitbooks.io/scrapy_zh/content/?q=

Give it to control the overall context:

items.py: a data structure, like a dictionary

arxiv.py spiders folder: reptile file, the specific name can be defined according to their needs

pipelines.py: Pipeline file, responsible for crawling data processing, including filtering, saving, downloading, etc.

settings.py: the configuration file, the more crucial is the role of the pipeline is to run the activation file

text

To crawling site is https://arxiv.org/list/cs.LG/recent

Download is very simple matter, I used here is the anaconda

conda install Scrapy

After installation, open cmd, enter scrapy test:

Let's start our little project:

The first stage

First, create a project called MolMachineLearning

scrapy startproject MolMachineLearning

We went to the rear to create a good project folder

cd MolMachineLearning

Its directory structure something like this:

Then we started to generate reptiles file:

scrapy genspider arxiv "arxiv.org"

You can see we have generated good reptiles file in the folder spiders

Write data structure items.py

 

Writing reptile file arxiv.py

Of course, we can create more spiders

can use

scrapy list

We have to see the creation of those spiders

Here we changed our URL manually start_urls goals:

Then rewrite the parse part:

elements in Xpath how to write about here, you can take a look tutorial, or there is a way, it is to open chome browser development tools

Xpath want to get at the right, select Copy XPath to get under Copy

 

This even if we write a good reptile file it! !

Then write pipelines.py file

This part of the encounter two problems:

At first author is written like this (wrong):

There are two different before and after comparison, the first open place without adding r

Secondly, using the utf8 encoding

Lead to errors as follows:

If not r then the path will translate transferred character \

Utf8 after the second overtime, I found that the last generation of txt file nothing, that data is not written into the txt file, remove the encoding to be all right, the specific cause is not clear.

The last configuration settings.py

这个的目的就是激活pipelines.py文件中的process_item函数执行,如果不配置,那么实际上pipelines.py是不会运行的

最后就是运行就可以啦:

scrapy crawl arxiv --nolog

 

注意是在如下目录下打开cmd运行的

结果:

第二阶段:

现在我们在第一阶段提取了论文题目,接下来要完成的目标是下载这些论文,框架的话还是和上面一样,这里简单说一下不同点吧:

一 :增加了一个url字段

items.py和arxiv.py修改如下

二 :使用了文件下载器FilesPipeline

pipelines.py修改如下

需要强调的是这里重写了一个类DownloadPapersPipeline,其重载了FilesPipeline,同时需要注意的是:

for paper_url in item['url']:

这句话中item['url']并不是我们理解的是由好多urls组成的,而是一个个单独的,所以在arxiv.py提取url字段的时候特意给其加了一个列表的形式,即如果不加这里[],那么比如对于一个url:https://arxiv.org/pdf/1904.05876.pdf,经过for paper_url in item['url']后提取出来的就是h啦,加上列表[https://arxiv.org/pdf/1904.05876.pdf],那么提取出来的就是正确的url啦,具体原因就是:pipelines.py文件是管道文件,它的执行不是等爬虫文件(arxiv)执行完,Item保存了所有数据,而是爬虫程序处理完一条数据后就会送到管道文件执行,所以送过去的是一条数据(理解有误,还望指正),总之想说的就是加[]

三 :设置配置文件

激活管道中相应的该类,同时设置pdf保存文件的路径。

运行结果:

会看到多了一个full文件,打开该文件:

已经下载好我们的文件啦,但是现在还有一个问题就是文件名不是我们想要的形式

我们可以利用我们提取的name字段来命名,需要注意的是:在win环境下文件名命名中是不允许有/ \ : * ? " < > |符号的,而从上面的例子中可以看到我们的论文题目中显然有的有冒号,于是我们要去掉这些特殊符号,方法多多,笔者这里用的就是re模块,不多解释啦,不是本文重点,感兴趣的可以百度一下

修改pipelines.py如下:

运行结果:

 

第三阶段:

如果你是想看看机器学习在各个领域的应用或者最近发表的机器学习文章有哪些?其实上面的例子就已经满足啦,可是假如我们只想获取某一方向的机器学习论文呢?那么我们就不得不进行一下筛选啦,假设我们现在只想要和深度学习有关的文章那么我们可以筛选题目中含有Deep的,当然了,为了使范围更广,我们可以多加关键字,比如通常CNN和RNN依据LSTM等等都是深度学习方面的论文,接下来我们就来干这件事情:

piplines.py修改如下:

可以看到,这里仅仅增加了红色框的部分,只有满足的item才会被返回,否则丢弃

settings.py修改如下:

需要说明的是:和之前对比,这里的又300 1 变为1和300 其实数字本身无多大意义,关键是相对大小,数字越小运行的优先级越高,这里将MolmachinelearningPipeline的优先级高于DownloadPapersPipeline即MolmachinelearningPipeline先运行,由于其内部实现了对item的过滤,过滤掉的数据之后不再传输给其他管道程序,所以MolmachinelearningPipeline中的item是过滤后的Item,达到我们的目的

运行结果:

结果下载的就只有深度学习方面的论文啦,注意这里是否能够精确的定位到你想要的论文的关键在于re模块正则化的应用,和爬虫本身并没有多大关系,为此这里不多这方面的解释啦

第四阶段:

做到这里,其实还没有结束,我们上面爬取的其实仅仅是一个页面的数据

而我们是想把最近的347篇都爬下来,怎么办呢?最通常的做法就是递归爬取

即在爬虫文件下的parse函数下加上

 urls = response.xpath('你的页数部分Xpath/@href').extract()
    for url in urls:
        yield Request(url, callback=self.parse)

但是笔者这里比较特殊:

会发现这里并不是全部显示,中间还有...,导致解析出来后是..,得不到这部分论文,解决的方法比较多,首先该网站做的比较好,可以通过点击more来增大步数,想这样:

还有就是点击all,即全部显示,这时候我们看一下其网址:

https://arxiv.org/list/cs.LG/pastweek?show=347

其实按月进行分割的

爬虫文件修改如下:

运行结果:

结束语:

其实即使全部下载下来也才300来篇,没有必要进行筛选,但是一旦数据量多了起来,其实第四阶段所做的事情还是很有必要的,这里之所以这么做是想尽可能的展现Scrapy的用法。

本文的主要目的:

通过本demo展现一下简单的框架:再写一遍很重要

items.py : 是数据结构,类似一个字典

spiders文件夹下的arxiv.py  : 是爬虫文件,具体的名字可以根据自己的需要进行定义

pipelines.py :管道文件,负责对爬取的数据进行处理包括过滤,保存,下载等等

settings.py : 配置文件,其中比较关键的作用是是激活管道文件的运行

 

 

Guess you like

Origin blog.csdn.net/weixin_42001089/article/details/89023114