Scrapy selector (the Selectors)

Copyright: arbitrary, seeding https://blog.csdn.net/qq_32662595/article/details/85113851!

1, the web page data extraction, some existing library purpose comprising:

  • BeautifulSoup is very popular among programmers web analytics library, which is based on the structure of the HTML code to construct a Python object handling of bad marks are also very reasonable, but it has one drawback: slow
    1. lxml is based ElementTree (not part of the Python standard library) of the python XML parsing libraries (you can also parse HTML).

2, Scrapy extract data has its own set of mechanisms. They are called selectors (seletors), because they have to "choose" a part of the HTML file by a specific CSS or XPath expressions.

  • XPath is a language used to select nodes in an XML file, also can be used in HTML.
  • CSS is a style of goalkeeper HTML document language.

3, Scrapy selector Selector instance is text (text) or TextResponse configuration. Automatically select the optimal analysis (XML vs HTML) depending on the input.
`` `

选择器:
xpath(query)
	寻找可以匹配xpath query 的节点,并返回 SelectorList 的一个实例结果,单一化其所有元素。列表元素也实现了 Selector 的接口。
exreact():
	串行化并将匹配到的节点返回一个unicode字符串列表。 结尾是编码内容的百分比。
css(query)
	应用给定的CSS选择器,返回 SelectorList 的一个实例。
	query 是一个包含CSS选择器的字符串
re(regex)
	应用给定的regex,并返回匹配到的unicode字符串列表。、
	regex 可以是一个已编译的正则表达式,也可以是一个将被 re.compile(regex) 编译为正则表达式的字符串。

HTML Source:

	<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
  • Construction selectors

     from scrapy.selector import Selector
     from scrapy.http import HtmlResponse
     
      response = HtmlResponse(url='http://baidu.com', body=body,encoding="utf8")
      Selector(response=response).xpath('//span/text()').extract()
      # response对象以 .selector 属性提供了一个selector
      response.selector.xpath('//span/text()').extract()
    
  • Use selector

    response.selector.xpath('//title/text()')
    	[<Selector (text) xpath=//title/text()>]
    response.css('title::text')
    	[<Selector (text) xpath=//title/text()>]
    
  • Nested selector

     links = response.xpath('//a[contains(@href, "image")]')
    	links.extract()
    	 结果:
    	[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
     	u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
     	u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
     	u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
     	u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
    
  • Combined with regular expressions using the selector

    response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
    [u'My image 1',
     u'My image 2',
     u'My image 3',
     u'My image 4',
     u'My image 5']
    

Guess you like

Origin blog.csdn.net/qq_32662595/article/details/85113851