[Python reptile road day4]: xpath xpath basic knowledge && lxml combined data analysis && crawling IMDb

XPath a **********:. ****** XPath is a find information in xml and html language, a language that can be traversed on the elements and attributes xml and html document.
chrome plug-in: XPath Helper
Firefox plug-ins : the Try XPath

XPath syntax:
predicate (Predicates):
predicate is used to search for a specific node or a node that contains a specified value.
Predicate is embedded in square brackets.
xpath to use : with "//" get elements throughout the page, and then write the label name, then write the predicate extraction. For example: // div [@ class = " abc"]
Note:
1. "/" acquisition sub-node, "//" Get all the nodes.
2.contains functions: a plurality of attribute values comprising this function may be used. For example: "// div [the contains (@class," abject ")]"
3. Note predicate: predicate subscripts starting from 1 not 0
XPath more detail see: HTTPS: //www.w3school. com.cn/xpath/xpath_syntax.asp

II library using analytical hmtl lxml Code:
1. html parsing strings, using the "lxml.etree.HTML" parsing, the following sample code:

html_element=etree.HTML(text)
print(etree.tostring(html_element,encoding='utf-8').decode('utf-8'))

2. Parse the html file, using "lxml.etree.parse" parsing, the following sample code:

html_element = etree.parse("lagou.html")
print(etree.tostring(html_element, encoding='utf-8').decode('utf-8'))

This function default xml parser, so for non-standard html, sometimes an error, so you should create yourself a 'html' parser code is as follows:

parser=etree.HTMLParser(encoding='utf-8')
html_element = etree.parse("lagou.html",parser=parser)
print(etree.tostring(html_element, encoding='utf-8').decode('utf-8'))

After lxml library after the code can be resolved hmtl xpath extracted.
lxml binding xpath Note:
1. xpath syntax, use "element.xpath", the following sample code:

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')
html = etree.parse("tengxun.html",parser=parser)
***div=html.xpath("//div[2]")[0]***
print(etree.tostring(div, encoding='utf-8').decode('utf-8'))
print(div)

#xpath function returns a list of
2. Get a tag attribute:
examples are as follows:

href=html.xpath("//a/@href")
#获取a标签的href属性的值

3. Get the text, obtained by xpath in the "text" function. Examples are as follows:
(a descendant elements # obtaining at a label, the "/" before adding. "")

adress=tr.xpath("./td[4]/text()")[0]

Example: crawling Douban upcoming movie information:

from lxml import etree
import requests
url="https://movie.douban.com/" 
response=requests.get(url,headers=headers)#解析器
text=response.text
html=etree.HTML(text)
ul=html.xpath("//ul[@class='ui-slide-content']")[0]
#print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))
lis=ul.xpath("./li")
movies=[]
for li in lis:
    #print(etree.tostring(li,encoding='utf-8').decode('utf-8'))
    title=li.xpath("@data-title")[0]
    year=li.xpath("@data-release")[0]
    director=li.xpath("@data-director")[0]
    actors=li.xpath("@data-actors")[0]
    picture=li.xpath(".//img/@src")
    movie={"title":title,
           "year":year,
           "director":director,
           "actors":actors,
           "picture":picture}
    movies.append(movie)
print(movies)

The results are as follows:
C: \ python38 \ python.exe "C: / python38 / new new Project / mydi / day4.py"
[{ 'title': 'Six - The Six Chinese survivors of the Titanic', 'year' : '2020', 'director' : ' Luo Fei', 'actors':' Schwank ',' Picture ': [' https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2581067467. jpg']}, {' Title ':' train to Spring ',' year ':' 2019 ',' director ':' Lee Ji ',' actors': 'any element xi / Lee Min City / Chen Yu Star', 'picture': [ 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2574813382.jpg']}, { 'title': 'Buqi not met', 'year': '2020', 'director': 'Jinguang Li', 'actors': 'Xiao Xu / Zhang Xueheng / Zhangwei Wei', 'picture': [ 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2572668239.jpg']}, { 'title': 'legitimate partner', 'year': '2019', 'director': 'Huang Lei', 'actors':' Aarif / Sandrine Pinna / white passenger ',' picture ': [' https: // img9 .doubanio.com / view / photo / s_ratio_poster / public / p2581586285.jpg ']}, {' title ':' Halfmoon Alice ',' year ':' 2020 ',' director ':' Zhanglin Zi ',' actors ':' off Xiaotong / Huangjing Yu / officer hung ',' picture ': [' https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2580211645.jpg ']}, {' title ':' gold Chan fell monsters ',' year ':' 2020 ',' director ':' Pang ',' actors ':' Shih Hsiao / Hu Jun / Yaoxing Tong ',' picture ': [' https: //img9.doubanio.com / view / photo / s_ratio_poster / public / p2564190636.jpg ']}, {' title ':' big red ',' year ':' 2020 ',' director ':' Like Long ',' actors ':' Bell bag / Clara Lee / ice Jia ',' picture ': [' https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2581346773.jpg ']}, {' title ':' - shaking Yao ',' year ':' 2020 ',' director ':' Ma Yong ',' actors ':' have formatting / Jiangyong Bo / oct Zuoyu ',' picture ': [' https://img3.doubanio.com/view/photo/s_ratio_poster /public/p2568504230.jpg ']}, {' title ':' colorful ',' year ':' 2020 ',' director ':' Juan ',' actors ':' Zhu Zhu / Amy Irving / Li Masao ',' picture ': [' https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg ']}, {' title ':' magic Kingdom enchanted ',' year ':' 2020 ',' director ':' display ',' actors ':' Lu Yao / Zhang Yang / Chen Yue ',' picture ': [' https://img3.doubanio.com/view/photo/s_ratio_poster/ public / p2577837112.jpg ']}][ 'Https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2581346773.jpg']}, { 'title': '- shaking Yao', 'year': '2020', 'director': 'Ma Yong', 'actors': 'have formatting / Jiangyong Bo / oct Zuoyu', 'picture': [ 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2568504230.jpg']}, { 'title': 'colorful', 'year': '2020', 'director': 'Juan', 'actors':' Zhu Zhu / Amy Irving / Josephine M ',' picture ': [' https: //img3.doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg ']}, {' title ':' magic Kingdom enchanted ',' year ':' 2020 ',' director ':' furnishings', 'actors':' Lu Yao / Zhang Yang / Chen Yue ',' picture ': [' https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg ']}][ 'Https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2581346773.jpg']}, { 'title': '- shaking Yao', 'year': '2020', 'director': 'Ma Yong', 'actors': 'have formatting / Jiangyong Bo / oct Zuoyu', 'picture': [ 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2568504230.jpg']}, { 'title': 'colorful', 'year': '2020', 'director': 'Juan', 'actors':' Zhu Zhu / Amy Irving / Josephine M ',' picture ': [' https: //img3.doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg ']}, {' title ':' magic Kingdom enchanted ',' year ':' 2020 ',' director ':' furnishings', 'actors':' Lu Yao / Zhang Yang / Chen Yue ',' picture ': [' https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg ']}]'Colorful', 'year': '2020', 'director': 'Juan', 'actors':' Zhu Zhu / Amy Irving / Josephine M ',' picture ': [' https: // img3. doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg ']}, {' title ':' magic Kingdom enchanted ',' year ':' 2020 ',' director ':' display ',' actors': 'Lu Yao / Zhang Yang / Chen Yue', 'picture': [ 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg']}]'Colorful', 'year': '2020', 'director': 'Juan', 'actors':' Zhu Zhu / Amy Irving / Josephine M ',' picture ': [' https: // img3. doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg ']}, {' title ':' magic Kingdom enchanted ',' year ':' 2020 ',' director ':' display ',' actors': 'Lu Yao / Zhang Yang / Chen Yue', 'picture': [ 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg']}]

Process finished with exit code 0

If the crawl-related front-end knowledge will be a lot easier.
Summary:
By four days of learning, lightweight already crawling reptile, it will also continue to update this study notes, the same as the white welcome our efforts together.

Released five original articles · won praise 1 · views 182

Guess you like

Origin blog.csdn.net/dinnersize/article/details/104320276