To Bole online articles for crawling target blog.jobbole.com , found in the "latest posts" option to see all articles
Generally, in carrying can scrapy xpath or css to extract data, defined def parse the spiders / jobbole.py in (self, response)
import scrapy class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/'] def parse(self, response): re_selector = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()')
Note: Because jqury will generate additional code, then we show in the source code and see the page load code may be different, so do not look for a hierarchy step by step, to find the best id, or to locate the class
Tips:
1) When we use the class to locate the label, you can ctrl + F to view the class name is unique in using F12
2) Xpath right path may direct replication
A. Xpath conventional method
1. Common rules are as follows
// Select descendant node from the current node, if no path in front of the symbol representing the entire document
/ Node from the current selected direct child node
. Select the current node
Select the current parent node ..
@ Select Properties
// * All nodes throughout the HTML text
Example 1
<html><body><div> <ul> <li class="item-0"><a href="link1.html"><span>first item</span></a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div></body></html>
1. Get parent node attributes
First, select a href attribute is link4.html the node, then the parent node acquired, and then get its class attribute result1 = response.xpath ( '// a [ @ href = "link4.html"] /../@ class') we can also get the parent node to parent :: result2 = response.xpath ( '// a [ @ href = "link4.html"] / parent :: * / @ class')
note:
// a node represents a html in all, they have more href attribute, where the role of [] is the attribute matching to find a href attribute of the node link4.html
2. Get inside the text node
Li class is acquired node text of item-1, result3 response.xpath = ( '// li [@ class = "Item-0"] / A / text ()') returns a value of [ 'first item', 'fifth item ']
3. Property Gets
Get all the nodes in a href attribute of all nodes li result4 = response.xpath ( '// li / a / @ href') returns a value of [ 'link1.html', 'link2.html' , 'link3.html' , 'link4.html', 'link5.html' ]
4. Select the sequential
result = response.xpath ( '// li [ 1] / a / text ()') # li node selecting a first result = response.xpath ( '// li [ last ()] / a / text ()' ) li # select the last node result = response.xpath li nodes ( '// li [position () <3] / a / text ()') # 3 is less than the selected position, i.e. nodes 1 and 2 result = response.xpath ( '// li [last ( ) - 2] / a / text ()') # antepenultimate selected nodes
5. Select Node shaft
1) returns the first node li all ancestor nodes, including html, body, div and UL Result response.xpath = ( '// li [1] / ancestor :: *') 2) returns the first node li <div> ancestor nodes result = response.xpath ( '// li [ 1] / ancestor :: div') . 3) returns the first node of all the property values li result = response.xpath ( '// li [ 1] / attribute :: * ') . 4) returns the first first node li all the child nodes, then add the defined condition is selected from the group href attribute of a node link1.html result = response.xpath (' // li [ 1 ] / a Child :: [@ the href = "link1.html"] ') . 5) returns the first node li all descendant nodes, and then add nodes span as long as conditions result = response.xpath (' // li [ . 1] / span the descendant :: ') . 6) after all nodes of the current node is obtained following the shaft, although the use of * match, but added a selection index, thus obtaining only the first two subsequent nodes, i.e. the second < <a> node li> node Result response.xpath = ( '// Li [. 1] / following :: * [2]') . 7) following-sibling node of the current available after all of the peer nodes, i.e. all subsequent <li>节点 result = response.xpath('//li[1]/following-sibling::*')
6. multivalued attribute matches
<Li class = "Li Li-First"> <a href="link.html"> First Item </a> </ Li> result5 response.xpath = ( '// Li [@ class = "Li"] / a / text () ') return null, because here the HTML text li class attribute nodes have 2 values li and li-first, before further use if the attributes match would not need to use Contain () function correctly follows result5 response.xpath = ( '// Li [the contains (@class, "Li")] / a / text ()') the contains () method, the first parameter is a property name, the second parameter passed the attribute value, as long as this attribute contains the name attribute value matching is completed can be passed
7. multi-attribute match here do not talk about the time frame, xpath conventional usage
Sometimes we need more properties to determine a node, then you need to match multiple properties, and can be used to connect
from lxml import etree text = ''' <li class = "li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result6 = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()') print(result)
Li node has here two class name and properties, and require connected by the operator and then placed in parentheses in the screening conditions
II. Debug commands
After executing cmd can be placed in the following code, you can enter the debug command line, this command has been made in the original code page, the test was successful def parse command function
scrapy shell
http://blog.jobbole.com/110287
Start debugging,
1. Obtain the article title
Title = response.xpath >>> ( '// div [@ class = "entry-header"] / h1 of / text ()') >>> title [<Selector XPath = '// div [@ class = "entry -header "] / h1 / text ( ) 'data =' 2016 Tencent software development interview questions (part) '>] >>> title.extract () [' 2016 Tencent software development interview questions (part) '] >>> title.extract () [0] '2016 Tencent software development interview questions (part)' >>> title.extract_first () '2016 Tencent software development interview questions (part)'
Explanation
1) selector Type extract () method will original data into a list of type
2) extract () will be a plurality of values, extract () [1] take the value of 2
3) extract_first () to obtain a first value, a string type. extract_first (default = ''), if not return to the default value taken
2. Obtain Issue Date
>>> response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace("·","").strip() '2017/02/18'
3. Points of praise, span tag has a lot of class name, pick one that looks like the only, test, and then use contains () function to simplify operation
>>> response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract() ['2'] >>> response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0] '2' >>> int(response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0]) 2
4. Number of collections, use regular, re module also scrapy built-in module, pay attention to use non-greedy match, otherwise it will take 8 to
>>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0] ' 28 收藏' >>> string = response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0] >>> import re >>> pattern = re.match(".*?(\d+).*", string) >>> pattern.group(1) '28'
It can be abbreviated as
>>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").re('.*?(\d+).*') ['28'] >>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").re('.*?(\d+).*')[0] '28'
5. Use list elements derived formula to obtain a portion of the label, the following words and interview to obtain the workplace. Applies to some articles did not comment tags
Found not to "comment" at the end of the elements >>> response.xpath ( "the p-// [@ class = 'entry-Meta-hide-ON-Mobile'] / A / text ()"). Extract () [ ' workplace ',' 9 comment ',' interview '] >>> tag_list response.xpath = ( "P // [@ class =' Meta-entry-ON-Mobile-hide '] / A / text ()"). Extract () >>> [Element for Element in tag_list not element.strip IF (). endsWith ( "review")] [ 'workplace', 'interview'] >>> tag_choose = [Element Element for Element in tag_list IF not .strip (). endswith ( "comments")] >>> Tags = ",." the Join (tag_choose) >>> Tags 'in the workplace, interview'
join () function basic syntax: 'sep'.join (seq). Expressed sep a delimiter, the seq all the elements combined into a new string
represents sep separator, it may be empty;
seq connected to data indicating the data type may be a list, string, or tuple dictionaries
Three. Css extraction method
Several selectors 1. css
a li
|
Select All in all nodes a li
|
ul + p
|
Ul behind the choice of the first p element, ul and p are siblings
|
div#container>ul
|
Select id for the div tag container, the first child element below the ul
|
ul ~ p
|
Select all p elements with adjacent ul
|
a[title]
|
Select all the elements containing a title attribute
|
a::attr(href)
|
Get a href attribute values for all elements
|
a[href="http://jobbole.com"]
|
Select all href attribute of a element value http://jobbole.com
|
a[href*="jobble"]
|
Select a href attribute contains all the elements of jobbole
|
a[href^="http"]
|
A selection beginning with http href attribute values for all elements
|
a[href$=".jpg"]
|
Select ending .jpg all elements of a href attribute value
|
input[type=radio]:checked
|
Select an element selected radio
|
div:not(#container)
|
Select all elements div id not equal to the container
|
The: nth-child (3)
|
The third element selected li
|
tr:nth-child(2n)
|
Tr element selected even bit
|
2. scrapy shell is used to extract data css
scrapy shell
http://blog.jobbole.com/110287
1) extract the title, you need to use css pseudo class :: text
Response.css >>> (. "Header h1 of-entry"). Extract () [ '<h1 of> 2016 Tencent software development interview questions (part) </ h1 of>'] >>> response.css (. "Entry- :: h1 of text header "). Extract () [0] '2016 Tencent software development interview questions (part)'
2) Article creation time
>>> response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace(" ·","") '2017/02/18'
Note: no space between here and the class name p, p represents the class element called entry-meta-hide-on-mobile's
3) Like the number of points, for the multi-valued attributes matching is convenient css
>>> response.css(".vote-post-up h10::text").extract()[0] '2'
4) the number of collections, pay attention to the direction of the escape character
>>> response.css(".bookmark-btn::text").extract()[0] ' 28 收藏' >>> string = response.css(".bookmark-btn::text").extract()[0] >>> tag=re.match(".*?(\d+).*", string) >>> tag.group(1) '28'
In fact, the regular re also scrapy built-in module, can be abbreviated as follows
>>> response.css(".bookmark-btn::text").re('.*?(\d+).*') ['28'] >>> response.css(".bookmark-btn::text").re('.*?(\d+).*')[0] '28'
5) extracting text content, the format is generally taken out
response.css("div.entry").extract()[0]
6) obtain career, reviews, interviews words
Response.css >>> ( "p.entry-Meta-hide-ON-A :: Mobile text"). Extract () [ 'workplace', '9 Comments', 'Interview']