Table of contents
1. Goal 1: Use etree to parse data
2. Goal 2: Use xpath to crawl specified data
3. Goal 3: Extract specified data
4. Small circle of network security
1. Goal 1: Use etree to parse data
The rest don’t need to be introduced too much, the previous exercises have been gone through for everyone
def get_page():
url = 'https://www.chinaz.com/'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
}
res1 = requests.get(url, headers=headers, timeout=10)
res = res1.content.decode('utf-8')
tree = etree.HTML(res)
The data parsing code is as follows
tree = etree.HTML(res)
Perform UTF-8 decoding on the returned content, otherwise there will be garbled characters
res = res1.content.decode('utf-8')
2. Goal 2: Use xpath to crawl specified data
Let's crawl through these headlines
Find the next level
You can see that they are all under different li tags
So their upper level label ul is equivalent to our list collection
Locate the xpath path
Locate the xpath path of li
Because we want to get all li lists under ul
The xpath path is as follows
list = tree.xpath('//*[@id="cz"]/div[2]/div[3]/div/div[1]/div[1]/div/div[2]/div[2]/div/ul/li')
print it out to see
3. Goal 3: Extract specified data
location xpath
Then there are 3 more tags to reach the h2 tag
Traverse each target tag and convert to text() format
f = open('test', 'w', encoding ='utf-8')
for l in ul_list:
desc = l.xpath('./div/div[1]/a/h2/text()')[0]
print(desc + '\n')
f.write(str(desc) + '\n')
f.close()
operation result
full code
import requests
from lxml import etree
def get_page():
url = 'https://www.chinaz.com/'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
}
res1 = requests.get(url, headers=headers, timeout=10)
res = res1.content.decode('utf-8')
tree = etree.HTML(res)
ul_list = tree.xpath('//*[@id="cz"]/div[2]/div[3]/div/div[1]/div[1]/div/div[2]/div[2]/div/ul/li')
f = open('test', 'w', encoding ='utf-8')
for l in ul_list:
desc = l.xpath('./div/div[1]/a/h2/text()')[0]
print(desc + '\n')
f.write(str(desc) + '\n')
f.close()
if __name__ == '__main__':
get_page()
4. Small circle of network security
GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge