Python crawls Jingdong commodity reviews

Find the real interface of the data

Open the Jingdong product website to view product reviews. We clicked on the comment to turn the page, and found that the URL did not change, indicating that the webpage is a dynamic webpage.

API name: item_review - get JD product reviews

public parameters

Get API test key&secret

name	type	must	describe
key	String	yes	Call key (must be spliced in the URL in GET mode)
secret	String	yes	call key
api_name	String	yes	API interface name (included in the request address) [item_search, item_get, item_search_shop, etc.]
cache	String	no	[yes, no] The default is yes, the cached data will be called, and the speed is relatively fast
result_type	String	no	[json,jsonu,xml,serialize,var_export] returns the data format, the default is json, and the content output by jsonu can be read directly in Chinese
lang	String	no	[cn,en,ru] translation language, default cn Simplified Chinese
version	String	no	API version

request parameters

Request parameter: num_iid=71619129750&page=1

Parameter description: item_id: product ID
page: number of pages

response parameters

Version: Date:

name	type	example value	describe
items	items[]		Get JD Product Reviews
rate_content	String	The style of this canvas shoe is quite good, it is also very versatile to wear, and the workmanship is very fine. !	comments
rate_date	Date	2020-07-16 17:04:45	comment date
pics	MIX	["//img30.360buyimg.com/n0/s128x96_jfs/t1/143538/26/2997/98915/5f10182dE075cf6f4/3893a6ebd54bf20b.jpg"]	comment pictures
display_user_nick	String	j***X	Buyer Nickname
auction_sku	String	Color: White (velvet); Size: 2XL	Review Product Attributes
add_feedback	String	The fabric of the clothes is very good and it is very comfortable to wear. The clothes fit well!	Review content

Through the loop, crawl the comment data of all pages

The key to page-turning crawling is to find the "page-turning" rule of the real address. We clicked on page 1, page 2, and page 3 respectively, and found that the different page numbers were the same except that the page parameters were inconsistent. The "page" of page 1 is 1, the "page" of page 2 is 2, the "page" of page 2 is 2, and so on. We nest a For loop and store the data through pandas. Run the code to automatically crawl the comment information of other pages and store it in the t.xlsx file. All codes are as follows:

import requests
import pandas as pd
items=[]
for i in range(1,20):
    header = {'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 SLBrowser/8.0.1.4031 SLBChan/105'}
    url=f'https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1684832645932&loginType=3&uuid=122270672.2081861737.1683857907.1684829964.1684832583.3&productId=100009464799&score=0&sortType=5&page={i}&pageSize=10&isShadowSku=0&rid=0&fold=1&bbtf=1&shield='
    response= requests.get(url=url,headers=header)
    json=response.json()
    data=json['comments']
    for t in data:
        content =t['content']
        time    =t['creationTime']
        item=[content,time]
        items.append(item)
df = pd.DataFrame(items,columns=['评论内容','发布时间'])
df.to_excel(r'C:\Users\蓝胖子\Desktop\t.xlsx',encoding='utf_8_sig')

Finally, the crawled data results are as follows: