【Data Analysis】Personal Shopping Record Collection and Visual Analysis


foreword

Recently, I saw the crawling and analysis of Taobao shopping records written for others in the document. I also want to briefly analyze my own. Since I often use Pinxixi, I crawled my shopping records and briefly analyzed them.

1. Shopping record crawling

import json
import re
import csv
import time
header={

    "AccessToken": "",

    "Content-Type": "application/json;charset=UTF-8",
    "Cookie": "",
    "Host": "omsproduction.yangkeduo.com",
    "Origin": "https://omsproduction.yangkeduo.com",
    "Referer": "https://omsproduction.yangkeduo.com/orders.html?type=0&comment_tab=1&combine_orders=1&main_orders=1&refer_page_name=personal&refer_page_id=10001_1681214651841_ptc6qkb8vb&refer_page_sn=10001",

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36"
}




def get_api(offset):
    url='https://omsproduction.yangkeduo.com/proxy/api/api/aristotle/order_list_v4?pdduid=7955997481811'

    data={
  "type": "all",
  "page": 1,
  "origin_host_name": "omsproduction.yangkeduo.com",
  # "page_from": p+1,
  # "anti_content": "0aqAfqnUgcKgjgEVARuwr1wgsALtDl_s5ze_pPYIw1cNL4nputy21tL4uQ-f1F-ehbb-eZUf2Iac6v7Invh4Hzh4UtcdBS2wmv7TEThLIkKzG4PcBf2fTc2p7qC_MU7VMluN8Nl4TgCk-onXYpAPXgjho4z2lQysWZe20vIgK3TEfhkCjXd8t9zb0as6FGJbwPsrFfQbsXtYs9DN9dswIUchg82Q7P7WI69foUf5cGCgv-0j8-aT6L0NY_b_7L6tY_EfkogeUnBZn8B52sv43VIhCjoFms7qJlbz9eDFCiacTaAztb_L8daDeU5J1p2wJIs84h0uWKedk89lMwGEIP4Dc0oZnUt-tGWgbOn8_gAvN9UododGK60O6HevEdeyiwYf8a8Hley9qWYCLkXnzcfycfeFXx5Lq2zDuVPPZnSYeZImOEOLm5n6IF5szF-a1TBJsY00mCk_oaIjKgIqu1WeL9dXVtLFgEfpvtwzLi5HgFtPa2bk7ZoW77WmFn7ObljmLf0s65zL70EzAgbfrCblRmdVgETIDD67R8O5dCFhz_mSZOlVW4aH4Z_5DAqpgcXj1tbUBQ4dEe5HAxffcEpg3qcpPAhtqZB2H1FyNErwtc0EsILhLtVb0sBUDlPLd_0zuYtA0XBv1BnEYUH8Y5-Sf__0QSJ_rdFHooQxozP1G6RKOD6GNwxPOaGPup7HNpOASm-Anb7XGHNMAQMnUcxQEIvk4tkddQRZJa-PQ_s7WkZObZBVrapJQiXM8AISRnQrxWDAGLcp7_Ll2W6YwrqeHX2ywLl6Qv8Z-G2KxuEJhVPhiJYx0yb15cA6jaRFek4wc5p-AKniC3I_KRV3FzQJ8uB9jsmaGFCTvvLo7Glr3s_mZI4vsRQG3t_swqOyMYLCwMtCazE4hbXqCM",
  "size": 10,
  "offset":offset
    }



    resp=requests.post(url,json=data,headers=header)
    time.sleep(1)

    return resp

def deel_tetx(resp):
    goods=json.loads(resp.text)['orders']
    datas=[]
    for i in goods:
        # print(i)
        good=i['order_goods'][0]['goods_name']
        price=i['order_goods'][0]['goods_price']/100
        order_time=i['order_time']
        order_status_prompt=i['order_status_prompt']
        if order_status_prompt=='交易成功':
            datas.append([good,price,order_time])

    get_save(datas)
    offset = goods[-2]['offset']
    return offset


def get_save(datas):
    with open('购物记录.csv','a',encoding='utf-8',newline='')as f:
        f=csv.writer(f)
        f.writerows(datas)


def main():
    offset='MO-01-230410-202674083412899'
    while 1:
        try:
            resp=get_api(offset)
            print(f'{offset}完成')
            offset=deel_tetx(resp)

        except:
            break


if __name__ == '__main__':
    main()

For cookies and tokens, just log in to the web version and capture the packets to get them. For the specific analysis process, just look at the code.offsetIn my opinion, it is the shopping record of each date, and only 10 items can be requested each time, and then the penultimate shopping record is obtainedoffsetAs the offset of the next request, you can research it yourself, and crawling is not too difficult.

2. Data Analysis

  • data read
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as sn
#%%
df=pd.read_csv('购物记录.csv')

1. Data cleaning

  • Since the returned date is a timestamp, we convert the timestamp to a date format
import time

def to_date(t):
    t=time.localtime(t)
    t=time.strftime('%Y-%m-%d',t)
    t=pd.to_datetime(t)
    return t

df['date']=df.date.apply(to_date)
df

2. Deduplication

  • Handle duplicate data. Because of shopping habits, some data is duplicated. Here, only two adjacent words that are the same are removed, and only one is kept.
df.drop_duplicates(subset=['good','price'],keep='first',inplace=True)
df

3. Data situation

df.describe()

insert image description here

Unexpectedly, there are still 0.01, as expected of me, most of the %75 are below 40 yuan, although the average is more than 90, but the std is relatively large, affected by the extreme value

4. Visual display

  • Shopping Range Pie Chart
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei']
matplotlib.rcParams['axes.unicode_minus'] = False
expence=[0,30,100,500,1000,5000,10000]
expence_count=pd.cut(df['price'],expence).value_counts()
expence_count
plt.pie(expence_count,labels=expence_count.keys())
plt.legend()
plt.title('购物区间')
plt.show()

According to consumption habits, it is divided into these intervals, and the number of shopping is calculated
insert image description here
. There are relatively few large-amount items. The most important thing is to shop for gadgets and ordinary daily necessities. Combined with the word cloud diagram below, food shopping is relatively good.

  • word cloud

Participle:

import jieba
from collections import Counter
from wordcloud import WordCloud
import numpy as np
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei']
matplotlib.rcParams['axes.unicode_minus'] = False
df['good']=df['good'].str.replace(r'[^\w\s]+','')

words=[]
words_list=[jieba.lcut(i.strip()) for i in df['good'].tolist()]
for i in words_list:
    words+=i


count=Counter(words)
count=count.most_common(n=100)
count=dict(count)

Eliminate the special characters inside, then use jieba for word segmentation, use the collections module to count
and then visualize it

word=WordCloud(font_path='simhei.ttf')#字体
word.generate_from_frequencies(count)
plt.imshow(word,interpolation='bilinear')
plt.axis('off')
plt.title('购物消费类别词云图')
plt.show()

Rice, student, and male appear more often, it seems that I am still a student who loves cooking

  • Bar chart
# 按月份分组并计算每月消费金额总和

monthly_expense=df.groupby(df['date'].dt.strftime('%Y-%m'))['price'].sum()
monthly_expense=pd.DataFrame(monthly_expense)

monthly_expense['price']=monthly_expense['price'].astype('int')

plt.figure(figsize=(10,6))

plt.xticks(fontsize=6)
plt.plot(monthly_expense.index,monthly_expense.price,'r--')
plt.bar(monthly_expense.index,monthly_expense.price)

# 显示数据标签
for i,a in enumerate(monthly_expense.index):
    plt.text(a,monthly_expense.price[i],
             '{}'.format(monthly_expense.price[i]),

             ha='center',
             va='bottom',
            )

plt.show()

insert image description here

There is no consumption in April, and the amount of shopping consumption this year is relatively low. I don’t do much shopping on the whole, and occasionally there will be some big consumption

Summarize

Since the returned data only has the tags of price, date, and name, there is no over-analysis, and you can decide for yourself.
When I have time, I will share the visual analysis of Taobao's shopping records
I hope you will support and share more interesting things in the future

Guess you like

Origin blog.csdn.net/qq_61260911/article/details/130101277