python crawler-xpath analysis

Xpath parsing

xpath expression

Required (pip installlxml) library

The xpath expression is the positioning syntax used to obtain the specified resource under the target html node. The xpath expression mainly consists of path expression + predicate + wildcard character

path expression

// Select descendant nodes from the current node
/ Select child nodes from current node
. Select current node
Select the parent node of the current node
@ Select attributes

predicate

Get the sibling node from the specified ranking or the node with the specified attribute (need to be used with the @ wildcard character)

[order] Sibling node index
[last()] last sibling node
[last()-order] The order of the penultimate sibling node is a custom number.
[@class=“dxd”] Specify the node whose class attribute is dxd

Wildcard

@ Get the attribute value under the node
@href Get image link
text() Get the text content of the node

Quote

from lxml import etree#加载xpath第三方库

How to deal with the situation of tbody:

This is a normative issue on the web page and can be skipped directly. When we locate the path, we canJust ignoreThis point

Summarize

==//== means all locations

==*== means all elements

Text value: //*[text()=‘text value’]

contains: fuzzy query //*[contains(@herf,’baidu’)]

starts-with(): starts with xxxApplicable id changes

Practical combat – Soufun.com – Get every province, city and city link

import requests
from lxml import etree
url = "https://esf.fang.com/newsecond/esfcities.aspx"
res = requests.get(url=url)
html = etree.HTML(res.text)# xpath解析的对象是html节点===》字符串的响应报文转化为html对象
details = html.xpath('.//*[@class="outCont"]/*/*')#拿到每一个li节点
items = []
for everyli in details:
    province = everyli.xpath("./strong/text()")
    city =everyli.xpath("./a/text()")
    cityUrl = everyli.xpath("./a/@href")
    item = {
    
    
            'province':province,
            'city':city,
            'cityUrl':cityUrl
        }
    items.append(item)
items

Practical combat – Beijing new house listings – related information

import requests
from lxml import etree
url = "https://newhouse.fang.com/house/s/?from=db"
res = requests.get(url=url)
html = etree.HTML(res.text)
details = html.xpath('.//*[@class="nl_con clearfix"]/*/*')
items = [] 
for li in details:
    house_name = li.xpath('.//*[@class="nlcd_name"]/a/text()')
    house_url = li.xpath('.//*[@class="nlcd_name"]/a/@href')
    house_type = li.xpath('.//*[@class="house_type clearfix"]/a/text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="house_type clearfix"]/a/text()
    house_area = li.xpath('.//*[@class="address"]/a/text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="address"]/a/text()
    housex_price = li.xpath('.//*[@class="nhouse_price"]//text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="nhouse_price"]//text()
    house_price = [x.strip() for x in housex_price if x.strip()!='']
    phone = li.xpath('.//*[@class="tel"]//text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="tel"]//text()
    phone_change = [x.strip() for x in phone if x.strip()!='']#如果不加if则列表中会得到空元素
    item ={
    
    
        "house_name":house_name[0].strip("\n\t"),
        "house_url":house_url[0],
        "house_type":house_type,
        "house_price":house_price,
        "phone":phone_change
    }
    items.append(item)
items

Small tip – delete empty elements in the list, \n \t \r elements

  • list_eg = ['',' ','hello','\n','world','\t']
    print(list_eg)
    
  • The output is as follows

  • ['', ' ', 'hello', '\n', 'world', '\t']
    
  • After adding the list comprehension

    list_eg = ['',' ','hello','\n','world','\t']
    list_eg_change = [x.strip() for x in list_eg if x.strip()!='']
    print(list_eg_change)
    
    • The output is as follows

    • ['hello', 'world']
      
    • The list comprehension is explained as follows

    • list_eg = ['',' ','hello','\n','world','\t']
      list_eg_change = []
      for i in list_eg:
          if i.strip() !='':
              i = i.strip()
              list_eg_change.append(i)
      print(list_eg_change)
      
    • The steps are:
      1. Traverse the list list_eg, and perform i.strip() on each element to delete the spaces around the characters.
      2. If i.strip() is not equal to the null value, assign i.strip() to i.
      3. List list_eg_change.append() to get the desired data.

    The syntax format of list comprehension is as follows:

    [表达式 for 迭代变量 in 可迭代对象 [if 条件表达式] ]#此格式中,[if 条件表达式] 不是必须的,可以使用,也可以省略。
    

Guess you like

Origin blog.csdn.net/jiuwencj/article/details/128961089