Article directory
foreword
Recently, I saw the crawling and analysis of Taobao shopping records written for others in the document. I also want to briefly analyze my own. Since I often use Pinxixi, I crawled my shopping records and briefly analyzed them.
1. Shopping record crawling
import json
import re
import csv
import time
header={
"AccessToken": "",
"Content-Type": "application/json;charset=UTF-8",
"Cookie": "",
"Host": "omsproduction.yangkeduo.com",
"Origin": "https://omsproduction.yangkeduo.com",
"Referer": "https://omsproduction.yangkeduo.com/orders.html?type=0&comment_tab=1&combine_orders=1&main_orders=1&refer_page_name=personal&refer_page_id=10001_1681214651841_ptc6qkb8vb&refer_page_sn=10001",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36"
}
def get_api(offset):
url='https://omsproduction.yangkeduo.com/proxy/api/api/aristotle/order_list_v4?pdduid=7955997481811'
data={
"type": "all",
"page": 1,
"origin_host_name": "omsproduction.yangkeduo.com",
# "page_from": p+1,
# "anti_content": "0aqAfqnUgcKgjgEVARuwr1wgsALtDl_s5ze_pPYIw1cNL4nputy21tL4uQ-f1F-ehbb-eZUf2Iac6v7Invh4Hzh4UtcdBS2wmv7TEThLIkKzG4PcBf2fTc2p7qC_MU7VMluN8Nl4TgCk-onXYpAPXgjho4z2lQysWZe20vIgK3TEfhkCjXd8t9zb0as6FGJbwPsrFfQbsXtYs9DN9dswIUchg82Q7P7WI69foUf5cGCgv-0j8-aT6L0NY_b_7L6tY_EfkogeUnBZn8B52sv43VIhCjoFms7qJlbz9eDFCiacTaAztb_L8daDeU5J1p2wJIs84h0uWKedk89lMwGEIP4Dc0oZnUt-tGWgbOn8_gAvN9UododGK60O6HevEdeyiwYf8a8Hley9qWYCLkXnzcfycfeFXx5Lq2zDuVPPZnSYeZImOEOLm5n6IF5szF-a1TBJsY00mCk_oaIjKgIqu1WeL9dXVtLFgEfpvtwzLi5HgFtPa2bk7ZoW77WmFn7ObljmLf0s65zL70EzAgbfrCblRmdVgETIDD67R8O5dCFhz_mSZOlVW4aH4Z_5DAqpgcXj1tbUBQ4dEe5HAxffcEpg3qcpPAhtqZB2H1FyNErwtc0EsILhLtVb0sBUDlPLd_0zuYtA0XBv1BnEYUH8Y5-Sf__0QSJ_rdFHooQxozP1G6RKOD6GNwxPOaGPup7HNpOASm-Anb7XGHNMAQMnUcxQEIvk4tkddQRZJa-PQ_s7WkZObZBVrapJQiXM8AISRnQrxWDAGLcp7_Ll2W6YwrqeHX2ywLl6Qv8Z-G2KxuEJhVPhiJYx0yb15cA6jaRFek4wc5p-AKniC3I_KRV3FzQJ8uB9jsmaGFCTvvLo7Glr3s_mZI4vsRQG3t_swqOyMYLCwMtCazE4hbXqCM",
"size": 10,
"offset":offset
}
resp=requests.post(url,json=data,headers=header)
time.sleep(1)
return resp
def deel_tetx(resp):
goods=json.loads(resp.text)['orders']
datas=[]
for i in goods:
# print(i)
good=i['order_goods'][0]['goods_name']
price=i['order_goods'][0]['goods_price']/100
order_time=i['order_time']
order_status_prompt=i['order_status_prompt']
if order_status_prompt=='交易成功':
datas.append([good,price,order_time])
get_save(datas)
offset = goods[-2]['offset']
return offset
def get_save(datas):
with open('购物记录.csv','a',encoding='utf-8',newline='')as f:
f=csv.writer(f)
f.writerows(datas)
def main():
offset='MO-01-230410-202674083412899'
while 1:
try:
resp=get_api(offset)
print(f'{offset}完成')
offset=deel_tetx(resp)
except:
break
if __name__ == '__main__':
main()
For cookies and tokens, just log in to the web version and capture the packets to get them. For the specific analysis process, just look at the code.offsetIn my opinion, it is the shopping record of each date, and only 10 items can be requested each time, and then the penultimate shopping record is obtainedoffsetAs the offset of the next request, you can research it yourself, and crawling is not too difficult.
2. Data Analysis
- data read
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as sn
#%%
df=pd.read_csv('购物记录.csv')
1. Data cleaning
- Since the returned date is a timestamp, we convert the timestamp to a date format
import time
def to_date(t):
t=time.localtime(t)
t=time.strftime('%Y-%m-%d',t)
t=pd.to_datetime(t)
return t
df['date']=df.date.apply(to_date)
df
2. Deduplication
- Handle duplicate data. Because of shopping habits, some data is duplicated. Here, only two adjacent words that are the same are removed, and only one is kept.
df.drop_duplicates(subset=['good','price'],keep='first',inplace=True)
df
3. Data situation
df.describe()
Unexpectedly, there are still 0.01, as expected of me, most of the %75 are below 40 yuan, although the average is more than 90, but the std is relatively large, affected by the extreme value
4. Visual display
- Shopping Range Pie Chart
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei']
matplotlib.rcParams['axes.unicode_minus'] = False
expence=[0,30,100,500,1000,5000,10000]
expence_count=pd.cut(df['price'],expence).value_counts()
expence_count
plt.pie(expence_count,labels=expence_count.keys())
plt.legend()
plt.title('购物区间')
plt.show()
According to consumption habits, it is divided into these intervals, and the number of shopping is calculated
. There are relatively few large-amount items. The most important thing is to shop for gadgets and ordinary daily necessities. Combined with the word cloud diagram below, food shopping is relatively good.
- word cloud
Participle:
import jieba
from collections import Counter
from wordcloud import WordCloud
import numpy as np
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei']
matplotlib.rcParams['axes.unicode_minus'] = False
df['good']=df['good'].str.replace(r'[^\w\s]+','')
words=[]
words_list=[jieba.lcut(i.strip()) for i in df['good'].tolist()]
for i in words_list:
words+=i
count=Counter(words)
count=count.most_common(n=100)
count=dict(count)
Eliminate the special characters inside, then use jieba for word segmentation, use the collections module to count
and then visualize it
word=WordCloud(font_path='simhei.ttf')#字体
word.generate_from_frequencies(count)
plt.imshow(word,interpolation='bilinear')
plt.axis('off')
plt.title('购物消费类别词云图')
plt.show()
Rice, student, and male appear more often, it seems that I am still a student who loves cooking
- Bar chart
# 按月份分组并计算每月消费金额总和
monthly_expense=df.groupby(df['date'].dt.strftime('%Y-%m'))['price'].sum()
monthly_expense=pd.DataFrame(monthly_expense)
monthly_expense['price']=monthly_expense['price'].astype('int')
plt.figure(figsize=(10,6))
plt.xticks(fontsize=6)
plt.plot(monthly_expense.index,monthly_expense.price,'r--')
plt.bar(monthly_expense.index,monthly_expense.price)
# 显示数据标签
for i,a in enumerate(monthly_expense.index):
plt.text(a,monthly_expense.price[i],
'{}'.format(monthly_expense.price[i]),
ha='center',
va='bottom',
)
plt.show()
There is no consumption in April, and the amount of shopping consumption this year is relatively low. I don’t do much shopping on the whole, and occasionally there will be some big consumption
Summarize
Since the returned data only has the tags of price, date, and name, there is no over-analysis, and you can decide for yourself.
When I have time, I will share the visual analysis of Taobao's shopping records
I hope you will support and share more interesting things in the future