Article directory
- Xpath parsing
-
- xpath expression
-
- Requires (pip install ==lxml==) library
- path expression
- predicate
- Wildcard
- Quote
- How to deal with the situation of tbody:
- Summarize
- Actual combat--SouFun.com--get every province, city and city link
- Practical combat--Beijing new house listings--related information
- Small tip--Delete empty elements in the list, \n \t \r elements
Xpath parsing
xpath expression
Required (pip installlxml) library
The xpath expression is the positioning syntax used to obtain the specified resource under the target html node. The xpath expression mainly consists of path expression + predicate + wildcard character
path expression
// | Select descendant nodes from the current node |
---|---|
/ | Select child nodes from current node |
. | Select current node |
… | Select the parent node of the current node |
@ | Select attributes |
predicate
Get the sibling node from the specified ranking or the node with the specified attribute (need to be used with the @ wildcard character)
[order] | Sibling node index |
---|---|
[last()] | last sibling node |
[last()-order] | The order of the penultimate sibling node is a custom number. |
[@class=“dxd”] | Specify the node whose class attribute is dxd |
Wildcard
@ | Get the attribute value under the node |
---|---|
@href | Get image link |
text() | Get the text content of the node |
Quote
from lxml import etree#加载xpath第三方库
How to deal with the situation of tbody:
This is a normative issue on the web page and can be skipped directly. When we locate the path, we canJust ignoreThis point
Summarize
==//== means all locations
==*== means all elements
Text value: //*[text()=‘text value’]
contains: fuzzy query //*[contains(@herf,’baidu’)]
starts-with(): starts with xxxApplicable id changes
Practical combat – Soufun.com – Get every province, city and city link
import requests
from lxml import etree
url = "https://esf.fang.com/newsecond/esfcities.aspx"
res = requests.get(url=url)
html = etree.HTML(res.text)# xpath解析的对象是html节点===》字符串的响应报文转化为html对象
details = html.xpath('.//*[@class="outCont"]/*/*')#拿到每一个li节点
items = []
for everyli in details:
province = everyli.xpath("./strong/text()")
city =everyli.xpath("./a/text()")
cityUrl = everyli.xpath("./a/@href")
item = {
'province':province,
'city':city,
'cityUrl':cityUrl
}
items.append(item)
items
Practical combat – Beijing new house listings – related information
import requests
from lxml import etree
url = "https://newhouse.fang.com/house/s/?from=db"
res = requests.get(url=url)
html = etree.HTML(res.text)
details = html.xpath('.//*[@class="nl_con clearfix"]/*/*')
items = []
for li in details:
house_name = li.xpath('.//*[@class="nlcd_name"]/a/text()')
house_url = li.xpath('.//*[@class="nlcd_name"]/a/@href')
house_type = li.xpath('.//*[@class="house_type clearfix"]/a/text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="house_type clearfix"]/a/text()
house_area = li.xpath('.//*[@class="address"]/a/text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="address"]/a/text()
housex_price = li.xpath('.//*[@class="nhouse_price"]//text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="nhouse_price"]//text()
house_price = [x.strip() for x in housex_price if x.strip()!='']
phone = li.xpath('.//*[@class="tel"]//text()')#.//*[@class="nl_con clearfix"]/*/*//*[@class="tel"]//text()
phone_change = [x.strip() for x in phone if x.strip()!='']#如果不加if则列表中会得到空元素
item ={
"house_name":house_name[0].strip("\n\t"),
"house_url":house_url[0],
"house_type":house_type,
"house_price":house_price,
"phone":phone_change
}
items.append(item)
items
Small tip – delete empty elements in the list, \n \t \r elements
-
list_eg = ['',' ','hello','\n','world','\t'] print(list_eg)
-
The output is as follows
-
['', ' ', 'hello', '\n', 'world', '\t']
-
After adding the list comprehension
list_eg = ['',' ','hello','\n','world','\t'] list_eg_change = [x.strip() for x in list_eg if x.strip()!=''] print(list_eg_change)
-
The output is as follows
-
['hello', 'world']
-
The list comprehension is explained as follows
-
list_eg = ['',' ','hello','\n','world','\t'] list_eg_change = [] for i in list_eg: if i.strip() !='': i = i.strip() list_eg_change.append(i) print(list_eg_change)
-
The steps are:
1. Traverse the listlist_eg
, and performi.strip()
on each element to delete the spaces around the characters.
2. Ifi.strip()
is not equal to the null value, assigni.strip()
toi
.
3. Listlist_eg_change.append()
to get the desired data.
The syntax format of list comprehension is as follows:
[表达式 for 迭代变量 in 可迭代对象 [if 条件表达式] ]#此格式中,[if 条件表达式] 不是必须的,可以使用,也可以省略。
-