scrapy framework Selector extracts data

The core technology for extracting data from pages is HTTP text parsing, which is processed by modules commonly used in python:

  BeautifulSoup is a very popular parsing library with simple API but slow parsing.

  lxml is a set of xml parsing library written in c language, with fast parsing speed and relatively complex API.

The Selector class in Scrapy is based on the lxml library and simplifies the API interface. In the process of using, first use the xpath or css selector to select the data to be extracted from the page, and then extract it.

Extract data

The selected content can be extracted by calling the following methods of the Selector or SelectList object

  extract()    returns the Unicode string of the selected content.

  extract_first()    SelectorList exclusive, returns the first Selector object to call the extract method. Usually when the SelectorList contains only one Selector object, choose to call this method.

  re()     uses a regular expression to extract a part of the selection.

    for example

      selector.xpath('.//b/text()') .extract #['price: 99.00 yuan', 'price: 88.00 yuan', 'price: 88.00 yuan']

      selector.xpath('.//b/text()').re('\d+\.\d+')       #[ '99.00','88.00','88.00']

  re_first()    returns the first Selector object in the SelectorList object to call the re method.

      selector.xpath('.//b/text()').re_first('\d+\.\d+')   # '99.00'

 

In the actual development process, there is almost no need to manually create the Selector object. The Response object automatically creates the Selector object with its own parameters.

       response.xpath('.//h1/text').extract()        # [ 'song','shi','chao']

       response.css('li::text').extract()                # [ 'song','shi','chao']

xpath selector

  Xpath is the xml path language, which is used to determine the language of a certain part of an xml document. An xml document (html belongs to xml) is a tree composed of a series of nodes.

Basic grammar

    The root of the selected document Describes an absolute path starting    from the root./Indicates that it is selected from the current node (for example, if a part is extracted, it will be used if it needs to be extracted from the extraction. If it is not added, it  will  be extracted from the entire document)

.Select       the current node to describe the relative path

..Select     the parent node of the current node to describe the relative path

ELEMENT     selects all ELEMENT element nodes in child nodes      

//ELEMENT   selects all ELEMENT element nodes in descendant nodes   

     Select all element child nodes

text()    selects all text child nodes

@ATTR   selects the attribute node named ATTR

@*   select all attribute nodes

[Predicate]   The predicate is used to find a specific node or a node containing a specific value

 

Example

  response.xpath('/html/body/div') #Select all divs under body

  Response.xpath('//a') #Select all a in the document

  response.xpath('/html/body//div') #Select the div in all nodes under the body, no matter where it is

  Response.xpath('//a/text()') #Select all the text of a

  response.xpath('/html/div/*') #Select all element child nodes of div

  Response.xpath('//div/*/img') #Select all img of div grandchild node

  Response.xpath('//img/@src') #Select the src attribute of all img

  Response.xpath('//a[1]/img/@*') #Select all attributes of img under the first a

  Response.xpath('//a[2]') #2 of all a

  response.xpath('//a[last()]') #The last one in all a('/a[last()-1]')#The second last ('//a[position()< =3]')#Use the position function to select the first three ('//div[@id]')#Select all divs with id attributes ('//div[@id="song"]')#Select all div with id attribute of song

  Response.xpath('//p[contains(@class,'song')]') #Select the p element containing 'song' in the class attribute

  response.xpath('//div/a | //div/p') Or, the page may be a may be p

 

css selector

  CSS is Cascading Style Sheets. The selector is not as powerful as xpath. The principle is to translate it into xpath expression and call the xpath method when selecting.

*        select all nodes

#container         selects the node whose id is container

.container          selects the node whose class contains container

 li a    selects all a nodes under all li

ul + p  selects the first p element after all ul

#container > ul     selects the first ul node whose id is container

a[class]   selects all a elements with class attribute

a[href="http://b.com"]   含有href="http://b.com"的a元素

a[href*='job']   contains the a element of the job

a[href^='https']  starts with the a element of https

 a[href$='cn']    a element ending with cn

 

response.css('div a::text').extract() the text of all a under all divs

response.css('div a::attr(href)').extract() the value of href

response.css('div>a:nth-child(1)') Select the first a of each div > It will be set to only be found in the child nodes, not in the grandchild nodes

response.css('div:not(#container)') selects all divs whose id is not container

response.css('div:first-child>a:last-child') The last a in the first div

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324888889&siteId=291194637