[python] Crawler notes (5) xpath analysis of data analysis

Focus crawler

Crawl the specified content in the page.
Encoding process:
Specify URL-initiate a request-obtain response data- data analysis -persistent storage

Data analysis classification

  • Regular match
  • bs4
  • xpath

Principles of Data Analysis
Insert picture description here

xpath analysis:

  • The most commonly used and most convenient and efficient way to analyze
  • Instantiate an etree object, and load the page source data into the object
  • Label positioning and data capture by calling related attribute methods in etree objects combined with xpath expressions

首先
pip install lxml
from lxml import etree
Insert picture description here

  • xpath('xpath expression')
    • tree = etree.parse(‘html’)
    • tree.xpath(’/html/head/title’)
      • / Is to start positioning from the root node, one level, // is multiple levels, which can mean to start positioning from any position
      • r = tree.xpath('//div[@class=“sh”]/p[3]') attribute positioning tag[@attrName=“attrValue”]
      • Get the text, /text() //text()
      • Take attribute value——/@attribute name< img src = "baidu.com"> —— /img/@scr
 
from lxml import etree
import requests

if __name__ == "__main__":
    url = 'https://sh.58.com/ershoufang/'

    ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
    headers = {
    
    
        "User-Agent":ua,
    }

    response1 = requests.get(url=url,headers=headers).text

    tree = etree.HTML(response1)
    li_list = tree.xpath('/html/body/div[5]/div[5]/div[1]/ul/li')
    for li in li_list:
        print(li.xpath('./div[2]/h2/a/text()'))

Guess you like

Origin blog.csdn.net/Sgmple/article/details/112059825