scrapy combat, using the built xpath, re extracted value and css

To Bole online articles for crawling target blog.jobbole.com , found in the "latest posts" option to see all articles

 
Generally, in carrying can scrapy xpath or css to extract data, defined def parse the spiders / jobbole.py in (self, response)
Copy the code
import scrapy
 
 
class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/']
 
    def parse(self, response):
        re_selector = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()')
Copy the code

 

 
Note: Because jqury will generate additional code, then we show in the source code and see the page load code may be different, so do not look for a hierarchy step by step, to find the best id, or to locate the class
 
Tips:
1) When we use the class to locate the label, you can ctrl + F to view the class name is unique in using F12
2) Xpath right path may direct replication
 
 
 
 
A. Xpath conventional method
 
1. Common rules are as follows
 
// Select descendant node from the current node, if no path in front of the symbol representing the entire document
/ Node from the current selected direct child node
. Select the current node
Select the current parent node ..
@ Select Properties
// * All nodes throughout the HTML text
 
 
Example 1
Copy the code
<html><body><div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div></body></html>
 
Copy the code

 

1. Get parent node attributes
First, select a href attribute is link4.html the node, then the parent node acquired, and then get its class attribute 
result1 = response.xpath ( '// a [ @ href = "link4.html"] /../@ class') 
we can also get the parent node to parent :: 
result2 = response.xpath ( '// a [ @ href = "link4.html"] / parent :: * / @ class')
note:
// a node represents a html in all, they have more href attribute, where the role of [] is the attribute matching to find a href attribute of the node link4.html
 
 
2. Get inside the text node
Li class is acquired node text of item-1, 
result3 response.xpath = ( '// li [@ class = "Item-0"] / A / text ()') 
returns a value of [ 'first item', 'fifth item ']

 

 
3. Property Gets
Get all the nodes in a href attribute of all nodes li 
result4 = response.xpath ( '// li / a / @ href') 
returns a value of [ 'link1.html', 'link2.html' , 'link3.html' , 'link4.html', 'link5.html' ]

 

 
4. Select the sequential
result = response.xpath ( '// li [ 1] / a / text ()') # li node selecting a first 
result = response.xpath ( '// li [ last ()] / a / text ()' ) li # select the last node 
result = response.xpath li nodes ( '// li [position () <3] / a / text ()') # 3 is less than the selected position, i.e. nodes 1 and 2 
result = response.xpath ( '// li [last ( ) - 2] / a / text ()') # antepenultimate selected nodes

 

 
5. Select Node shaft
Copy the code
1) returns the first node li all ancestor nodes, including html, body, div and UL 
Result response.xpath = ( '// li [1] / ancestor :: *') 
 
2) returns the first node li <div> ancestor nodes 
result = response.xpath ( '// li [ 1] / ancestor :: div') 
 
. 3) returns the first node of all the property values li 
result = response.xpath ( '// li [ 1] / attribute :: * ') 
 
. 4) returns the first first node li all the child nodes, then add the defined condition is selected from the group href attribute of a node link1.html 
result = response.xpath (' // li [ 1 ] / a Child :: [@ the href = "link1.html"] ') 
 
. 5) returns the first node li all descendant nodes, and then add nodes span as long as conditions 
result = response.xpath (' // li [ . 1] / span the descendant :: ') 
 
. 6) after all nodes of the current node is obtained following the shaft, although the use of * match, but added a selection index, thus obtaining only the first two subsequent nodes, i.e. the second < <a> node li> node 
Result response.xpath = ( '// Li [. 1] / following :: * [2]') 
 
. 7) following-sibling node of the current available after all of the peer nodes, i.e. all subsequent <li>节点
result = response.xpath('//li[1]/following-sibling::*')
Copy the code

 

 
6. multivalued attribute matches
Copy the code
<Li class = "Li Li-First"> <a href="link.html"> First Item </a> </ Li> 
 
result5 response.xpath = ( '// Li [@ class = "Li"] / a / text () ') 
return null, because here the HTML text li class attribute nodes have 2 values li and li-first, before further use if the attributes match would not need to use Contain () function 
 
correctly follows 
result5 response.xpath = ( '// Li [the contains (@class, "Li")] / a / text ()') 
the contains () method, the first parameter is a property name, the second parameter passed the attribute value, as long as this attribute contains the name attribute value matching is completed can be passed
Copy the code

 

 
 7. multi-attribute match here do not talk about the time frame, xpath conventional usage
Sometimes we need more properties to determine a node, then you need to match multiple properties, and can be used to connect
Copy the code
from lxml import etree
text = '''
<li class = "li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result6 = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
Copy the code
Li node has here two class name and properties, and require connected by the operator and then placed in parentheses in the screening conditions
 
 
 
 
 
II. Debug commands
After executing cmd can be placed in the following code, you can enter the debug command line, this command has been made in the original code page, the test was successful def parse command function
 
Start debugging,
1. Obtain the article title
Copy the code
Title = response.xpath >>> ( '// div [@ class = "entry-header"] / h1 of / text ()') 
>>> title 
[<Selector XPath = '// div [@ class = "entry -header "] / h1 / text ( ) 'data =' 2016 Tencent software development interview questions (part) '>] 
>>> title.extract () 
[' 2016 Tencent software development interview questions (part) '] 
>>> title.extract () [0] 
'2016 Tencent software development interview questions (part)' 
>>> title.extract_first () 
'2016 Tencent software development interview questions (part)'
Copy the code
Explanation
1) selector Type extract () method will original data into a list of type
2) extract () will be a plurality of values, extract () [1] take the value of 2
3) extract_first () to obtain a first value, a string type. extract_first (default = ''), if not return to the default value taken
 
 
2. Obtain Issue Date
>>> response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace("·","").strip()
'2017/02/18'
 

 

3. Points of praise, span tag has a lot of class name, pick one that looks like the only, test, and then use contains () function to simplify operation
>>> response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()
['2']
>>> response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0]
'2'
>>> int(response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0])
2

 

 
4. Number of collections, use regular, re module also scrapy built-in module, pay attention to use non-greedy match, otherwise it will take 8 to
Copy the code
>>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0]
' 28 收藏'
>>> string = response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0]
>>> import re
>>> pattern = re.match(".*?(\d+).*", string)
>>> pattern.group(1)
'28'
Copy the code

 It can be abbreviated as

>>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").re('.*?(\d+).*')
['28']
>>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").re('.*?(\d+).*')[0]
'28'

 

 

5. Use list elements derived formula to obtain a portion of the label, the following words and interview to obtain the workplace. Applies to some articles did not comment tags
Copy the code
Found not to "comment" at the end of the elements 
>>> response.xpath ( "the p-// [@ class = 'entry-Meta-hide-ON-Mobile'] / A / text ()"). Extract () 
[ ' workplace ',' 9 comment ',' interview '] 
>>> tag_list response.xpath = ( "P // [@ class =' Meta-entry-ON-Mobile-hide '] / A / text ()"). Extract () 
>>> [Element for Element in tag_list not element.strip IF (). endsWith ( "review")] 
[ 'workplace', 'interview'] 
>>> tag_choose = [Element Element for Element in tag_list IF not .strip (). endswith ( "comments")] 
>>> Tags = ",." the Join (tag_choose) 
>>> Tags 
'in the workplace, interview'
Copy the code

 

join () function basic syntax: 'sep'.join (seq). Expressed sep a delimiter, the seq all the elements combined into a new string
represents sep separator, it may be empty;
seq connected to data indicating the data type may be a list, string, or tuple dictionaries
 
 
 
 
Three. Css extraction method
 
Several selectors 1. css
 
a li 
Select All in all nodes a li
ul + p
Ul behind the choice of the first p element, ul and p are siblings
div#container>ul
Select id for the div tag container, the first child element below the ul
ul ~ p 
Select all p elements with adjacent ul
a[title] 
Select all the elements containing a title attribute
a::attr(href)
Get a href attribute values ​​for all elements
a[href="http://jobbole.com"] 
Select all href attribute of a element value http://jobbole.com
a[href*="jobble"]  
Select a href attribute contains all the elements of jobbole
a[href^="http"]
A selection beginning with http href attribute values ​​for all elements
a[href$=".jpg"]  
Select ending .jpg all elements of a href attribute value
input[type=radio]:checked      
Select an element selected radio
div:not(#container) 
Select all elements div id not equal to the container
The: nth-child (3)     
The third element selected li
tr:nth-child(2n)      
Tr element selected even bit
 
 
 
2. scrapy shell is used to extract data css
 
1) extract the title, you need to use css pseudo class :: text
Response.css >>> (. "Header h1 of-entry"). Extract () 
[ '<h1 of> 2016 Tencent software development interview questions (part) </ h1 of>'] 
>>> response.css (. "Entry- :: h1 of text header "). Extract () [0] 
'2016 Tencent software development interview questions (part)'
 

 

2) Article creation time
>>> response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace(" ·","")
'2017/02/18'
Note: no space between here and the class name p, p represents the class element called entry-meta-hide-on-mobile's
 
 
3) Like the number of points, for the multi-valued attributes matching is convenient css
>>> response.css(".vote-post-up h10::text").extract()[0]
'2'

 

 
4) the number of collections, pay attention to the direction of the escape character
>>> response.css(".bookmark-btn::text").extract()[0]
' 28 收藏'
>>> string = response.css(".bookmark-btn::text").extract()[0]
>>> tag=re.match(".*?(\d+).*", string)
>>> tag.group(1)
'28'

 

In fact, the regular re also scrapy built-in module, can be abbreviated as follows

>>> response.css(".bookmark-btn::text").re('.*?(\d+).*')
['28']
>>> response.css(".bookmark-btn::text").re('.*?(\d+).*')[0]
'28'

 

 

 
5) extracting text content, the format is generally taken out
response.css("div.entry").extract()[0]

 

 
6) obtain career, reviews, interviews words
Response.css >>> ( "p.entry-Meta-hide-ON-A :: Mobile text"). Extract () 
[ 'workplace', '9 Comments', 'Interview']

Guess you like

Origin www.cnblogs.com/yunlongaimeng/p/11526418.html