Using XPath crawling west thorn agent

Because Scrapy of course, extract page information more convenient to use XPath, and then into the article.

B watched station in the introduction of XPath: https://www.bilibili.com/video/av30320885?from=search&seid=17721548966745663758

 

Understanding XPath

  1. What is XPath

    1, a language parsing of XML (XML HTML actually is a child), is widely used to parse the HTML data

    2, almost all languages ​​can use XPath, such as Java and C language

    3, in addition to other means XPath for XML parsing, such as: BeautifulSoup, lxml, DOM, SAX, JSDOM, DOM4J, minixml etc.

  2, XPath syntax

    XPath syntax In fact, only three categories

    1, the hierarchy: / direct child, // skip

    2, Properties: @ property access

    3, the function: contains () text (), etc.

 

Using XPath

  1, the use of XPath in a browser

  Similarly with the video which was about, but because I wanted to crawl West thorn agent, so the analysis directly on the west thorn proxy site

  https://www.xicidaili.com/nn/1

  

 

   After a brief analysis page

  

 

   There are two ip found, one is in the <tr class = "odd"> ... </ tr> which, in another <tr class> ... </ tr>

  But are tr node (except for the first addition, because the first one is )

  But a lot of difference tr tag from the root level, so we use // tr

  

 

   You can see there are 101, 100 is the number of ip, there is a first, it is the blue box.

  In-depth analysis inside tr

  

 

   里面有几个td节点,ip在第二个td节点,port在第三个td节点,type在第四个td节点,这几个是我们所需要的,同级之下的提取视频里面没有讲,所以我去查了一下,可以使用//tr/td[2]来获取ip

  

 

   可以看到这里是100个搜索结果,即100个ip,同样的方法获取port和type即可

 

  2,在Scrapy中使用XPath

  我们获取到了之后,在scrapy中整理输出第一页的代理ip,在spider爬取页面里写成:

  

 

   可以看到输出了100个IP地址:

  

 

   当然在爬取之前要将scrapy的User-Agent设置好,还有robots.txt协议也要设置,才开始爬取,不然只会获取到空的结果。

  这个只是第一页的IP地址,验证IP和储存IP,反爬等都还没有处理,剩下的下次另写一篇吧

 

Guess you like

Origin www.cnblogs.com/Cl0ud/p/12322760.html