[Network security takes you to practice reptiles-100 exercises] Practice 11: xpath quickly locates and extracts data

Table of contents

1. Goal 1: Use etree to parse data

2. Goal 2: Use xpath to crawl specified data

3. Goal 3: Extract specified data

 4. Small circle of network security


1. Goal 1: Use etree to parse data

The rest don’t need to be introduced too much, the previous exercises have been gone through for everyone

def get_page():
    url = 'https://www.chinaz.com/'
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
    }

    res1 = requests.get(url, headers=headers, timeout=10)
    res = res1.content.decode('utf-8')  
    tree = etree.HTML(res)

The data parsing code is as follows

    tree = etree.HTML(res)

Perform UTF-8 decoding on the returned content, otherwise there will be garbled characters

    res = res1.content.decode('utf-8') 



2. Goal 2: Use xpath to crawl specified data

Let's crawl through these headlines

 

Find the next level

You can see that they are all under different li tags

So their upper level label ul is equivalent to our list collection

Locate the xpath path

Locate the xpath path of li

Because we want to get all li lists under ul

 The xpath path is as follows

    list = tree.xpath('//*[@id="cz"]/div[2]/div[3]/div/div[1]/div[1]/div/div[2]/div[2]/div/ul/li')

print it out to see

 



3. Goal 3: Extract specified data

location xpath

Then there are 3 more tags to reach the h2 tag

 

Traverse each target tag and convert to text() format

    f = open('test', 'w', encoding ='utf-8')
    for l in ul_list:
        desc = l.xpath('./div/div[1]/a/h2/text()')[0]
        print(desc + '\n')
        f.write(str(desc) + '\n')
    f.close()

operation result

 

full code

import requests
from lxml import etree
def get_page():
    url = 'https://www.chinaz.com/'
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
    }
    res1 = requests.get(url, headers=headers, timeout=10)
    res = res1.content.decode('utf-8')  
    tree = etree.HTML(res)
    ul_list = tree.xpath('//*[@id="cz"]/div[2]/div[3]/div/div[1]/div[1]/div/div[2]/div[2]/div/ul/li')
    f = open('test', 'w', encoding ='utf-8')
    for l in ul_list:
        desc = l.xpath('./div/div[1]/a/h2/text()')[0]
        print(desc + '\n')
        f.write(str(desc) + '\n')
    f.close()
if __name__ == '__main__':
    get_page()



 4. Small circle of network security

README.md Book Bansheng/Network Security Knowledge System-Practice Center-Code Cloud-Open Source China (gitee.com) https://gitee.com/shubansheng/Treasure_knowledge/blob/master/README.md

GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge

Guess you like

Origin blog.csdn.net/qq_53079406/article/details/131616857