Python gets Du Niang hot search data and saves it as Excel

1. Obtain the goal and prepare for the work

1. Acquisition target: This acquisition tutorial target: a certain hot search

2. Preparation

  • environment python3.x
  • requests
  • pandas

       requests and pandas are the libraries required for this tutorial. requests is used to simulate http requests, and pandas is used for data processing (save the results as Excel).

  • Open the requested page in the Chrome browser, and press F12 to open the browser console. Click Network to select the network, and then click XHR. Find the corresponding XHR request, and you can get the hot search data interface.

2. Start coding

  1. Import dependent libraries
import requests
import pandas as pd
  1. Construct a request header:
browse_header = {
    
    
    "Accept": "application/json, text/plain, */*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
    "Host": "top.baidu.com",
    "Referer": "https://top.baidu.com/board",
}
  1. Define a request interface, that is, the data address
url = "https://top.baidu.com/api/board?platform=wise&tab=realtime"
  1. Send the request, since the interface returns JSON format, so here is one step, and the response result is also converted into JSON format.
json = requests.get(url, headers=browse_header).json()
  1. Note: There are two kinds of trending searches in a certain degree, one is the top trending search and the other is the ordinary trending search, so we have to obtain them separately.
# 爬取置顶热搜
top_content_list = json['data']['cards'][0]['topContent']
# 爬取普通热搜
content_list = json['data']['cards'][0]['content']
  1. Then perform json analysis separately, corresponding fields (title, ranking, hot search index, description, link address).
df = pd.DataFrame(  # 拼装爬取到的数据为DataFrame
	{
    
    
		'热搜标题': title_list,
		'热搜排名': order_list,
		'热搜指数': score_list,
		'描述': desc_list,
		'链接地址': url_list
	}
)
df.to_excel('百度热搜榜.xlsx', index=False)  # 保存结果数据

Completion code:

import requests
import pandas as pd

browse_header = {
    
    
    "Accept": "application/json, text/plain, */*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
    "Host": "top.baidu.com",
    "Referer": "https://top.baidu.com/board",
}
url = "https://top.baidu.com/api/board?platform=wise&tab=realtime"
json = requests.get(url, headers=browse_header).json()

# 爬取置顶热搜
top_content_list = json['data']['cards'][0]['topContent']
print(top_content_list)
# 爬取普通热搜
content_list = json['data']['cards'][0]['content']
print(content_list)
title_list = []
order_list = []
score_list = []
desc_list = []
url_list = []
for top_item in top_content_list:
    title_list.append(top_item.get('word'))
    order_list.append("置顶")
    score_list.append(top_item.get("hotScore"))
    desc_list.append(top_item.get("desc"))
    url_list.append(top_item.get("url"))
index = 0
for content in content_list:
    index += 1
    title_list.append(content.get('word'))
    order_list.append(index)
    score_list.append(content.get("hotScore"))
    desc_list.append(content.get("desc"))
    url_list.append(content.get("url"))
df = pd.DataFrame({
    
    
    '热搜标题': title_list,
    '热搜排名': order_list,
    '热搜指数': score_list,
    '描述': desc_list,
    '链接地址': url_list
})
df.to_excel('百度热搜榜.xlsx', index=False)  # 保存结果数据

Finally, check the obtained data:
insert image description here
a total of 31 pieces of data (1 top trending search + 30 general trending searches).

3. Summary

The above is the entire obtained data. If you have other data that needs to be obtained with python, please leave a message in the comment area. Finally, I recommend a front-end utility for you: JS encryption tool .

Guess you like

Origin blog.csdn.net/qq_43762932/article/details/131213248