Basic use xpath
A mounting package lxml
pip install lxml
II. Use
1. Use:
from lxml Import etree # leader packet
Import Requests Response = resquests. GET ( 'www.baidu.com') # generates an html objects # html = etree.parse (html file) parameter # html documents html = etree. the HTML ( Response. text) # is a string of text div = HTML. XPath ( "XPath expression ') # returns a text list
1. Obtain the outermost label, traversing all internal sub-tab, access the label text
content_list =div.xpath('.//div[@class="d_post_content j_d_post_content "]/text()').extract()
2. Regular remove all labels <. *?> Re.compile.sub ()
content_list=div.xpath('.//div[@class="d_post_content j_d_post_content "]')
pattern=re.compile(r('<.*?>'),re.S)
content=pattern.sub('',content_list)
3./text () Gets the label's text // text () Gets the label and sub-label text
content_list = div.xpath(‘.//div[@class=”d_post_content j_d_post_content “]//text()’).extract()
4 using xpath ( 'string (.)') Obtained in this way all the splice and the text
content_list=div.xpath('.//div[@class="d_post_content j_d_post_content "]').xpath('string(.)').extract()[0]+'\n'
After the text content acquisition print (content_list) to view the contents. For processing format is as follows:
remove = re.compile('\s') content = '' for string in content_list: string = remove.sub('',string) content += string
Method string: content = div.xpath ( 'string (.// div [@ class = "content"])') strip () # Get all text in this div to form a string.