Xpath crawls 58 second-hand housing information

First introduce the basic knowledge.
Data analysis xpath: the most commonly used and most convenient and efficient analysis method, with strong versatility.
-Xpath parsing principle:-1. Instantiate an etree
object, and load the parsed page source data into the object
-2. call the xpath method in the etree object combined with the xpath expression to realize the location and content of the label The capture
-environment installation:
-pip install lxml
-how to instantiate an etree object: from lxml import etree
-1. Load the source code data in the local HTML document into the etree object:
etree.parse (filepath document path)
- 2. The source code data obtained from the Internet can be loaded into the object
etree.HTML('page_text')
-xpath('xpath expression') Key point-
/: means to locate from the root node. It means one level
-//: it means multiple levels, you can start positioning from any position
-xpath attribute positioning
-//div[@class="song"] format tag[@attrName="attrVlaue"]
-index positioning, The index is starting from 1'//div[@class=“song”]/p[3]'
-take text:
-/text() get the direct text content of the label
-//text() get the label Non-linear text content (all text content)
-take attributes:
- @ attrName e.g. IMG / SRE @
- two ways to solve the Chinese garbled:
. -img_name.encode ( 'ISO-8859-1') decode ( 'GBK')
-response.encoding = 'UTF-. 8'
on the code below Myself. .


```python
import  requests
from lxml import etree
if __name__=='__main__':
    #获取数据
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36'
    }
    url = 'https://bj.58.com/ershoufang/?PGTID=0d100000-0000-16e7-6cab-37d675ef4780&ClickID=2'
    page_text = requests.get(url=url,headers=headers).text
    #实例化对象
    tree = etree.HTML(page_text)
    #获取标签数据,标签定位
    li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
    fp = open('./58.text','w',encoding='utf-8')
    for li in li_list:
        title = li.xpath('./div[2]/h2/a/text()')[0]
        price = li.xpath('./div[3]/p[@class="sum"]/b/text()')[0]
        print(title)
        print(price+'万')
        fp.write(title)
        fp.write(price + '\n')


Friends who buy a house to know about it haha! !

Guess you like

Origin blog.csdn.net/qwerty1372431588/article/details/106086470