Python gets some hot search data and saves it as Excel

1. Obtain the goal and prepare for the work

1. Acquisition target: The target of this acquisition tutorial: a hot search

2. Preparation

  • environment python3.x
  • requests
  • pandas

       requests and pandas are the libraries required for this tutorial. requests is used to simulate http requests, and pandas is used for data processing (save the results as Excel).

  • Open the requested page in the Chrome browser, and press F12 to open the browser console. Click Network to select the network, and then click XHR. Find the corresponding XHR request, and you can get the hot search data interface.
    insert image description here

2. Start coding

  1. Import dependent libraries
import requests
import pandas as pd
  1. Construct a request header:
browse_header = {
    
    
    "Accept": "application/json, text/plain, */*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
    "Host": "www.zhihu.com",
    "Referer": "https://www.zhihu.com/hot",
    "Cookie": "_xsrf=Pd0NpG6J8kZdHtzBVnNyQP1g0rO7NKeg; _zap=d7f27b9f-4fe3-4ef4-9376-df278af16940;"
}
  1. Define a request interface, that is, the data address
url = "https://www.zhihu.com/api/v3/feed/topstory/hot-lists/total?limit=50&desktop=true"
  1. Send the request, since the interface returns JSON format, so here is one step, and the response result is also converted into JSON format.
json = requests.get(url, headers=browse_header).json()
  1. Extract the hot search data list.
# 热搜列表
content_list = res['data']
  1. Then perform json analysis separately, corresponding fields (title, ranking, hot search index, description, link address).
df = pd.DataFrame(  # 拼装爬取到的数据为DataFrame
	{
    
    
		'热搜标题': title_list,
		'热搜排名': order_list,
		'热搜指数': score_list,
		'描述': desc_list,
		'链接地址': url_list
	}
)
df.to_excel('百度热搜榜.xlsx', index=False)  # 保存结果数据

Note: In this code, the returned link address is a bit different, we have to make some adjustments: the adjustments are as follows:
url_list.append(content['target']['url'].replace('api', 'www').replace('questions', 'question'))

Full code:

import requests
import pandas as pd


browse_header = {
    
    
    "Accept": "application/json, text/plain, */*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
    "Host": "www.zhihu.com",
    "Referer": "https://www.zhihu.com/hot",
    "Cookie": "_xsrf=Pd0NpG6J8kZdHtzBVnNyQP1g0rO7NKeg; _zap=d7f27b9f-4fe3-4ef4-9376-df278af16940;"
}

url = "https://www.zhihu.com/api/v3/feed/topstory/hot-lists/total?limit=50&desktop=true"

res = requests.get(url, headers=browse_header).json()
# 热搜列表
content_list = res['data']
title_list = []
order_list = []
score_list = []
desc_list = []
url_list = []
index = 0
for content in content_list:
    index += 1
    order_list.append(index)
    title_list.append(content['target']['title'])
    score_list.append(content['detail_text'])
    desc_list.append(content['target']['excerpt'])
    url_list.append(content['target']['url'].replace('api', 'www').replace('questions', 'question'))

df = pd.DataFrame({
    
    
    '热搜标题': title_list,
    '热搜排名': order_list,
    '热搜热度': score_list,
    '描述': desc_list,
    '链接地址': url_list
})
df.to_excel('知乎热搜榜.xlsx', index=False)  # 保存结果数据

Finally, check the obtained data:
insert image description here
a total of 50 pieces of data.

3. Summary

The above is the entire obtained data. If you have other data that needs to be obtained with python, please leave a message in the comment area. Finally, I recommend a front-end JS utility for you: JS Online Tools .

Guess you like

Origin blog.csdn.net/qq_43762932/article/details/131249565