A Python crawler case to teach you how to parse xpath data

Article directory

  • Basic concepts of xpath

  • xpath parsing principle

  • Environment installation

  • How to instantiate an etree object:

  • xpath('xpath expression')

  • Xpath crawls 58 second-hand housing examples

  • Crawl URL

  • Complete code

  • renderings

  • xpath image parsing download example

  • Crawl URL

  • Complete code

  • renderings

  • Example of crawling city names across the country with xpath

  • Crawl URL

  • Complete code

  • renderings

  • Xpath crawl resume template example

  • Crawl URL

  • Complete code

  • renderings

Basic concepts of xpath

xpath parsing: the most commonly used, convenient and efficient parsing method. Highly versatile.

xpath parsing principle

1. Instantiate an etree object and load the parsed page source code data into the object.

2. Call the xpath method in the etree object to combine the xpath expression to achieve label positioning and content capture.

Environment installation

pip install lxml

How to instantiate an etree object:

from lxml import etree

1. Load the remote data in the local html file into the etree object:

etree.parse(filePath)

2. You can load the original code data obtained from the Internet into the object:

etree.HTML(‘page_text’)

xpath('xpath expression')

  • /: indicates positioning starting from the root node. represents a level

  • //: Indicates multiple levels. It can mean starting positioning from any position

  • Attribute positioning: //div[@class='song'] tag[@attrName='attrValue']

  • Index positioning: //div[@class='song']/p[3] The index starts from 1

  • Get text:

    • /text() obtains the direct text content in the tag

    • //Non-direct text content in the text() tag (all text content)

  • Get attributes:/@attrName ==>img/src

Xpath crawls 58 second-hand housing examples

Crawl URL

https://xa.58.com/ershoufang/Full code

from lxml import etree
import requests

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://xa.58.com/ershoufang/'
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//section[@class="list"]/div')
    fp = open('./58同城二手房.txt','w',encoding='utf-8')
    for div in div_list:
        title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
        print(title)
        fp.write(title+'\n'+'\n')

xpath image parsing download example

Crawl URL

https://pic.netbian.com/4kmeinv/Full code

import requests,os
from lxml import etree

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://pic.netbian.com/4kmeinv/'
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
    if not os.path.exists('./piclibs'):
        os.mkdir('./piclibs')
    for li in li_list:
        detail_url ='https://pic.netbian.com' + li.xpath('./img/@src')[0]
        detail_name = li.xpath('./img/@alt')[0]+'.jpg'
        detail_name = detail_name.encode('iso-8859-1').decode('GBK')
        detail_path = './piclibs/' + detail_name
        detail_data = requests.get(url=detail_url, headers=headers).content
        with open(detail_path,'wb') as fp:
            fp.write(detail_data)
            print(detail_name,'seccess!!')

Example of crawling city names across the country with xpath

Crawl URL 

https://www.aqistudy.cn/historydata/Complete code

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'https://www.aqistudy.cn/historydata/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    }
    page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
    tree = etree.HTML(page_text)
    #热门城市   //div[@class="bottom"]/ul/li
    #全部城市   //div[@class="bottom"]/ul/div[2]/li
    a_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
    fp = open('./citys.txt','w',encoding='utf-8')
    i = 0
    for a in a_list:
        city_name = a.xpath('.//a/text()')[0]
        fp.write(city_name+'\t')
        i=i+1
        if i == 6:
            i = 0
            fp.write('\n')
    print('爬取成功')

 

Xpath crawl resume template example

Crawl URL

https://sc.chinaz.com/jianli/free.htmlFull code

import requests,os
from lxml import etree

if __name__ == '__main__':
    url = 'https://sc.chinaz.com/jianli/free.html'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    }
    page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
    tree = etree.HTML(page_text)
    a_list = tree.xpath('//div[@class="box col3 ws_block"]/a')
    if not os.path.exists('./简历模板'):
        os.mkdir('./简历模板')
    for a in a_list:
        detail_url = 'https:'+a.xpath('./@href')[0]
        detail_page_text = requests.get(url=detail_url,headers=headers).content.decode('utf-8')
        detail_tree = etree.HTML(detail_page_text)
        detail_a_list = detail_tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[1]/a')
        for a in detail_a_list:
            download_name = detail_tree.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
            download_url = a.xpath('./@href')[0]
            download_data = requests.get(url=download_url,headers=headers).content
            download_path = './简历模板/'+download_name+'.rar'
            with open(download_path,'wb') as fp:
                fp.write(download_data)
                print(download_name,'success!!')

Guess you like

Origin blog.csdn.net/qiqi1220/article/details/128669418