How to use Xpath selector Scrapy extracts object information (both) from HTML

A while ago we Scrapy explains how to start the project as well as some tips on Scrapy reptiles introduction, the car did not have time junior partner can poke these articles:

Taught you how to create a new framework reptile scrapy first item (on)

Taught you how to create a new scrapy reptile framework of the first project (under)

About Scrapy reptiles projects running and debugging tips (Part I)

About Scrapy reptiles projects running and debugging tips (Part II)

Today we will introduce how to use the selector in Xpath Scrapy target information extracted from the HTML. In Scrapy, it offers two ways to extract data, one is Xpath selectors, one is the CSS selectors, this lecture we first focus Xpath selector, it is still Bole online web site as an example.

 

1, open the Web site, and then randomly select any article view, as shown below.

 

The information we need to extract main title, date, subject, comments, text, and so on.

2, then we can start writing code, the code base as shown below, it should be noted that start_urls parameter value to a specific URL, other code is not changed.

 

3, back to the original page, by pressing a keyboard shortcut key F12 or right-click on the page, and then select the "check (N)" pop commissioning interface page, as shown in FIG.

 

4, click on the red box in the figure of a small icon, you can achieve the interaction between the web and the data source, it can easily help us locate the label.

 

5, shown below, when we choose the figure above icon, and then select a title on the page, the page later will automatically jump to the source of our positioning, can be seen in the title

Under the label.

 

 

6、尔后我们就可以根据上图中的网页层次结构写出标题的Xpath表达式,这里先提供一种比较笨的方法,从头到尾进行罗列的写,“/html/body/div[1]/div[3]/div[1]/div[1]/h1”,有没有发现灰常的辛苦,像这种大标题信息还比较好提取一些,若是碰到犄角旮旯的信息,就比较难写表达式了,而且这种方式容易出错,效率还低。不过小伙伴们不用灰心,浏览器给我们提供了一个便捷的方式,让我们可以直接复制Xpath表达式。在标题处或者目标信息处右键,然后选择“Copy”,再选择“Copy Xpath”即可进行复制该标签的Xpath表达式,具体过程如下图所示。

 

可以看到复制的Xpath表达式为“//*[@id="post-113659"]/div[1]/h1”,其中id="post-113659"是属于这篇文章的一个标识,如下图所示。

 

通过该标识我们就可以很快的定位到标签,其与我们用笨方法手动写出来的Xpath表达式有时候并不是一致的。下面将两个Xpath表达式所匹配的内容分别进行输出。

7、将Xpath表达式写入Scrapy爬虫主体文件中,尔后Debug我们之前定义的main.py文件,将会得到下图的输出。可以看到selector1和selector2中的数据即是网页上的内容,而且内容是一致的。

 

之后点击停止Debug模式,便可以退出Debug模式。

8、从上图中我们可以看到选择器将标签

 

也都取出来了,而我们想要取的内容仅仅是标签内部的数据,此时只需要使用在Xpath表达式后边加入text()函数,便可以将其中的数据进行取出。

 

 

通过这篇文章,我们可以了解到尽管我们自己写出的Xpath表达式和浏览器给我们返回的Xpath表达式在写法上并不一致,但是程序运行之后,其返回的数据内容是一致的。换句话说,关于某个目标数据的Xpath表达式并不是唯一的,只要符合Xpath表达式语法,即便是写的很短,也是没问题的,你开心就好。此外在Scrapy爬虫框架中,text()函数常常与Xpath表达式运用在一块,用于提取节点中的数据内容。

Guess you like

Origin www.cnblogs.com/dcpeng/p/10990499.html