A python crawler case, take you to master the xpath data analysis method!

Article Directory

  • Basic concepts of xpath
  • Xpath analysis principle
  • Environmental installation
  • How to instantiate an etree object:
  • xpath('xpath expression')
  • Xpath crawls 58 second-hand housing examples
  • Crawl URL
  • Complete code
  • Effect picture
  • Xpath image analysis download example
  • Crawl URL
  • Complete code
  • Effect picture
  • Example of xpath crawling national city names
  • Crawl URL
  • Complete code
  • Effect picture
  • Xpath crawling resume template example
  • Crawl URL
  • Complete code
  • Effect picture
  • Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.

    Many people who have done case studies do not know how to learn more advanced knowledge.

    So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course!

    QQ group: 701698587

The basic concept of
xpath xpath analysis: the most commonly used and the most convenient and efficient way of parsing. Strong versatility.

Principles of xpath parsing
1. Instantiate an etree object, and load the parsed page source data into the object
2. Call the xpath method in the etree object combined with the xpath expression to achieve tag positioning and content capture.
Environmental installation

pip install lxml



How to instantiate an etree object:

from lxml import etree

1. Load the far away data in the local html file into the etree object:

etree.parse(filePath)



2. The source code data obtained from the Internet can be loaded into the object:

etree.HTML(‘page_text’)


xpath('xpath expression')
-/: means to locate from the root node. Represents one level
-//: Represents multiple levels. It can mean to start positioning from any position
-attribute positioning: //div[@class='song'] tag[@attrName='attrValue']
-index positioning: //div[@class='song']/p[3 ] The index starts from 1
-Take the text:
    -/text() gets the direct text content in the label
    -//text() the non-direct text content in the label (all text content)     -Take the
attribute:
/@attrName = =>img/src

xpath crawls 58 second-hand housing examples
crawl URL
https://xa.58.com/ershoufang/ complete code

from lxml import etree
import requests

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://xa.58.com/ershoufang/'
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//section[@class="list"]/div')
    fp = open('./58同城二手房.txt','w',encoding='utf-8')
    for div in div_list:
        title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
        print(title)
        fp.write(title+'\n'+'\n')

Insert picture description here


Xpath image analysis download example
Crawling URL
https://pic.netbian.com/4kmeinv/ complete code
 

import requests,os
from lxml import etree

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://pic.netbian.com/4kmeinv/'
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
    if not os.path.exists('./piclibs'):
        os.mkdir('./piclibs')
    for li in li_list:
        detail_url ='https://pic.netbian.com' + li.xpath('./img/@src')[0]
        detail_name = li.xpath('./img/@alt')[0]+'.jpg'
        detail_name = detail_name.encode('iso-8859-1').decode('GBK')
        detail_path = './piclibs/' + detail_name
        detail_data = requests.get(url=detail_url, headers=headers).content
        with open(detail_path,'wb') as fp:
            fp.write(detail_data)
            print(detail_name,'seccess!!')

Insert picture description here
Example of xpath crawling national city names.
Crawling URL
https://www.aqistudy.cn/historydata/ complete code

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'https://www.aqistudy.cn/historydata/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    }
    page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
    tree = etree.HTML(page_text)
    #热门城市   //div[@class="bottom"]/ul/li
    #全部城市   //div[@class="bottom"]/ul/div[2]/li
    a_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
    fp = open('./citys.txt','w',encoding='utf-8')
    i = 0
    for a in a_list:
        city_name = a.xpath('.//a/text()')[0]
        fp.write(city_name+'\t')
        i=i+1
        if i == 6:
            i = 0
            fp.write('\n')
    print('爬取成功')

Insert picture description here


Example of xpath crawling resume template
crawling URL
https://sc.chinaz.com/jianli/free.html complete code

import requests,os
from lxml import etree

if __name__ == '__main__':
    url = 'https://sc.chinaz.com/jianli/free.html'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    }
    page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
    tree = etree.HTML(page_text)
    a_list = tree.xpath('//div[@class="box col3 ws_block"]/a')
    if not os.path.exists('./简历模板'):
        os.mkdir('./简历模板')
    for a in a_list:
        detail_url = 'https:'+a.xpath('./@href')[0]
        detail_page_text = requests.get(url=detail_url,headers=headers).content.decode('utf-8')
        detail_tree = etree.HTML(detail_page_text)
        detail_a_list = detail_tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[1]/a')
        for a in detail_a_list:
            download_name = detail_tree.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
            download_url = a.xpath('./@href')[0]
            download_data = requests.get(url=download_url,headers=headers).content
            download_path = './简历模板/'+download_name+'.rar'
            with open(download_path,'wb') as fp:
                fp.write(download_data)
                print(download_name,'success!!')

Insert picture description here

Guess you like

Origin blog.csdn.net/Python_kele/article/details/115218850