Basic use xpath

Basic use xpath

A mounting package lxml

pip install lxml

II. Use

1. Use:

from lxml Import etree   # leader packet 
Import Requests Response = resquests. GET ( 'www.baidu.com') # generates an html objects # html = etree.parse (html file) parameter # html documents html = etree. the HTML ( Response. text)   # is a string of text div = HTML. XPath ( "XPath expression ')   # returns a text list





 

1. Obtain the outermost label, traversing all internal sub-tab, access the label text

content_list =div.xpath('.//div[@class="d_post_content j_d_post_content "]/text()').extract()

2. Regular remove all labels <. *?> Re.compile.sub ()

content_list=div.xpath('.//div[@class="d_post_content j_d_post_content "]')

pattern=re.compile(r('<.*?>'),re.S)

content=pattern.sub('',content_list)

3./text () Gets the label's text // text () Gets the label and sub-label text

content_list = div.xpath(‘.//div[@class=”d_post_content j_d_post_content “]//text()’).extract()

4 using xpath ( 'string (.)') Obtained in this way all the splice and the text

content_list=div.xpath('.//div[@class="d_post_content j_d_post_content "]').xpath('string(.)').extract()[0]+'\n'

After the text content acquisition print (content_list) to view the contents. For processing format is as follows:

remove = re.compile('\s') content = '' for string in content_list: string = remove.sub('',string) content += string

 

Method string: content = div.xpath ( 'string (.// div [@ class = "content"])') strip () # Get all text in this div to form a string.

Guess you like

Origin www.cnblogs.com/Deaseyy/p/11266786.html