Don't know which moon cakes to buy for the Mid-Autumn Festival? Python crawls 4000 moon cake categories to tell you the answer!

foreword

As the Mid-Autumn Festival is approaching, the sales of moon cakes continue to grow, and moon cakes of different flavors are also launched, and the price difference is also large. What kind of mooncake should we buy? Today, let’s use Python to crawl the information of a treasure mooncake, and perform a visual analysis of the crawled data. Let’s see which mooncakes are the most popular and which store sells the most mooncakes~
insert image description here
Skip to the end of the article to get exclusive benefits for fans.

1. Core function design

In general, we need to first obtain moon cake data from website crawlers, clean the data, and finally visualize it for analysis and display. The dismantling requirements can be roughly sorted out and we need to divide them into the following steps to complete:

  1. Obtain moon cake data through crawlers, including moon cake name, sales volume, store, province and city, etc.
  2. Preprocess and clean the acquired data, unify sales data units, provinces, cities, etc., and obtain the number of moon cakes after cleaning.
  3. Visually display the cleaned data, including the top 10 best-selling mooncake flavors, the top 10 best-selling mooncake stores, the distribution of mooncake price ranges, and the distribution of mooncake sales in various provinces and cities.

2. Implementation steps

1. Crawling data

In this article, we use the selenium module to crawl moon cake data. In order to ensure smooth data crawling, first we need to install the selenium module, which can be installed through the pip command.

pip install selenium

If you install and use selenium for the first time, many people may have the following exceptions, how to solve it?
insert image description here
The main reason is that the client side simulated by selenium operates on the browser, but the driver version of the corresponding browser does not match . So we need to know the version of our current browser first. You can open the browser and enter chrome://version/ in the address bar to view the current version number of Google.
insert image description here
In this way, we can go to Google Chrome Driver to find the corresponding version and download the driver. After downloading and decompressing, you can see chromedriver.exe.
insert image description here
We need to copy it to the Python path, so we can use selenium to simulate client browser operations normally. insert image description here
Next, we conduct web page analysis to obtain the corresponding mooncake name, store name, sales volume, price, and region. According to the data analysis, it can be found that:

  • The information of each moon cake is in the item's class="item J_MouserOnverReq"
  • The class=“row row-2 title” corresponding to the moon cake name
  • The strong tag corresponding to the price
  • The number of payers corresponds to class="deal-cnt"
  • Store information in class="shop"
  • The shipping area corresponds to class="location"

insert image description here
We have analyzed the structure of the web page above, then we can start crawling the data we need. After obtaining all the data resources, we can save the data.

Get the item:

# author:Dragon少年
# 获取商品
def get_product(key_word):
    # 定位输入框
    browser.find_element_by_id("q").send_keys(key_word)
    # 定义点击按钮,并点击
    browser.find_element_by_class_name('btn-search').click()
    browser.maximize_window()
    # 等待20秒,方便手动登录
    time.sleep(20)
    # 定位这个“页码”,获取“共100页这个文本”
    page_info = browser.find_element_by_xpath('//div[@class="total"]').text
    # findall()返回的是一个列表
    page = re.findall("(\d+)", page_info)[0]
    return page

retrieve data:

# author:Dragon少年
# 获取数据
def get_data():
    # 所有的信息都在items节点下
    items = browser.find_elements_by_xpath('//div[@class="items"]/div[@class="item J_MouserOnverReq  "]')
    for item in items:
        pro_desc = item.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text
        # 价格
        pro_price = item.find_element_by_xpath('.//strong').text
        # 付款人数
        buy_num = item.find_element_by_xpath('.//div[@class="deal-cnt"]').text
        # 店铺
        shop = item.find_element_by_xpath('.//div[@class="shop"]/a').text
        # 发货地
        address = item.find_element_by_xpath('.//div[@class="location"]').text
        with open('{}.csv'.format(key_word), mode='a', newline='', encoding='utf-8-sig') as f:
            csv_writer = csv.writer(f, delimiter=',')
            csv_writer.writerow([pro_desc, pro_price, buy_num, shop, address])

Selenium simulated crawling:

# author:Dragon少年
key_word = input("请输入您要搜索的商品:")
    browser = webdriver.Chrome()
    browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    
    

        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })
    browser.get('https://www.taobao.com/')
    page = get_product(key_word)
    print(page)
    get_data()
    page_num = 1
    while int(page) != page_num:
        print("=" * 100)
        print("正在爬取第{}页".format(page_num + 1))
        browser.get('https://s.taobao.com/search?q={}&s={}'.format(key_word, page_num * 44))
        browser.implicitly_wait(15)
        get_data()
        page_num += 1
    print("爬取结束!")

At this point, we can crawl the moon cake data and save it, as shown in the following figure.
insert image description here

2. Data cleaning

Next, we need to clean the data obtained by the crawler. First, we can remove duplicate data and delete records that no one has purchased. The core code is as follows:

# author:Dragon少年
# 读取爬虫数据
df = pd.read_csv("月饼.csv", encoding='utf-8-sig', header=None)
df.columns = ["商品名", "价格", "购买人数", "店铺", "地址"]
# 去除重复的数据
df.drop_duplicates(inplace=True)
print(df.shape)
# 删除购买人数0的记录
df['购买人数'] = df['购买人数'].replace(np.nan,'0人付款')

For the number of purchasers, some are displayed in units of 10,000. We need to unify the sales volume, sort out the information of each province in the delivery area, and finally save the cleaned data. The core code is as follows:

df['num'] = [re.findall(r'(\d+\.{0,1}\d*)', i)[0] for i in df['购买人数']]  # 提取数值
df['num'] = df['num'].astype('float')  # 转化数值型
# 提取单位(万)
df['unit'] = [''.join(re.findall(r'(万)', i)) for i in df['购买人数']]  # 提取单位(万)
df['unit'] = df['unit'].apply(lambda x:10000 if x=='万' else 1)
# 计算销量
df['销量'] = df['num'] * df['unit']

# 删除没有发货地址的店铺数据 获取省份
df = df[df['地址'].notna()]
df['省份'] = df['地址'].str.split(' ').apply(lambda x:x[0])
# 删除多余的列
df.drop(['购买人数', '地址', 'num', 'unit'], axis=1, inplace=True)
# 重置索引
df = df.reset_index(drop=True)
df.to_csv('月饼清洗数据.csv')

At this point, we can sort and clean the crawled moon cake data, as shown in the following figure.
insert image description here

3. Visual Analysis

Next, we need to visualize the data. Here we use pyecharts , a class library for generating Echarts charts, which is convenient for generating visual charts based on data in Python.

The blogger has written an article about pyecharts before, which introduces the use methods and cases of various charts in detail. If you don't know it, you can learn about pyecharts first. [Yiwen Society's cool chart tool pyecharts]

Top 10 stores by sales:

Next, we can read the data at the end of the cleaning, obtain the moon cake sales data of each store by grouping the stores, and count the top ten stores with sales, which are displayed by a bar chart. The core code is as follows:

# 计算月饼总销量Top10的店铺
shop_top10 = df.groupby('店铺')['销量'].sum().sort_values(ascending=False).head(10)
# 绘制柱形图
bar1 = Bar(init_opts=opts.InitOpts(width='600px', height='450px')) 
bar1.add_xaxis(shop_top10.index.tolist())
bar1.add_yaxis('销量', shop_top10.values.tolist()) 
bar1.set_global_opts(title_opts=opts.TitleOpts(title='销量Top10店铺-Dragon少年'),
                     xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-30))) 
bar1.render("销量Top10店铺-Dragon少年.html")
bar1.render_notebook()

insert image description here
We can see that Daoxiangcun Food Store, Tmall Supermarket, Huamei Food, Zhenwei Food, etc. are the top sellers. Among them, Daoxiangcun has the best sales volume, and the total sales volume is far ahead.

Sales Top 10 Mooncakes:

We can also obtain the best-selling mooncakes by grouping mooncakes by name, and count the top ten mooncakes by sales, and display them through a bar chart. The core code is as follows:

# 计算销量top10月饼
shop_top10 = df.groupby('商品名')['销量'].sum().sort_values(ascending=False).head(10)

# 绘制柱形图
bar0 = Bar(init_opts=opts.InitOpts(width='750px', height='450px')) 
bar0.add_xaxis(shop_top10.index.tolist())
bar0.add_yaxis('销量', shop_top10.values.tolist()) 
bar0.set_global_opts(title_opts=opts.TitleOpts(title='销量Top10月饼-Dragon少年'),
                     xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-30))) 
bar0.render("销量Top10月饼-Dragon少年.html")
bar0.render_notebook()

insert image description here
We can see the most popular 10 kinds of mooncakes. Among them, the Daoxiangcun mooncake gift box is the most popular, and the sales volume has reached 50w.

Sales ratio of mooncakes at different prices:

Next, we can also divide according to the price range of moon cakes, count the sales distribution of moon cakes in each price range below 50 yuan, 50-150 yuan, 150-500 yuan, and more than 500 yuan, and visualize them through a pie chart. The core code is as follows:

def price_range(x): #按照淘宝推荐划分价格区间
    if x <= 50:
        return '50元以下'
    elif x <= 150:
        return '50-150元'
    elif x <= 500:
        return '150-500元'
    else:
        return '500元以上'
df['price_range'] = df['价格'].apply(lambda x: price_range(x)) 
price_cut_num = df.groupby('price_range')['销量'].sum() 
data_pair = [list(z) for z in zip(price_cut_num.index, price_cut_num.values)]
# 饼图
pie1 = Pie(init_opts=opts.InitOpts(width='750px', height='350px'))
# 内置富文本
pie1.add( 
        series_name="销量",
        radius=["35%", "55%"],
        data_pair=data_pair,
        label_opts=opts.LabelOpts(formatter='{b}—占比{d}%'),
)
pie1.set_global_opts(legend_opts=opts.LegendOpts(pos_left="left", pos_top='30%', orient="vertical"), 
                     title_opts=opts.TitleOpts(title='不同价格月饼销量占比-Dragon少年'))
pie1.render("不同价格月饼销量占比-Dragon少年.html")
pie1.render_notebook()

insert image description here
It can be seen that the sales price of moon cakes is mainly distributed between ≤ 150 yuan, accounting for about 84% of the sales, and fewer people buy a single moon cake with a price of more than 500 yuan.

Distribution of moon cake sales by province:

It is best that under the statistics, the origin and sales of moon cakes are distributed in provinces and regions, and displayed through map visualization. The core code is as follows:

# 计算销量
province_num = df.groupby('省份')['销量'].sum().sort_values(ascending=False) 
# 绘制地图
map1 = Map(init_opts=opts.InitOpts(width='750px', height='350px'))
map1.add("", [list(z) for z in zip(province_num.index.tolist(), province_num.values.tolist())],
         maptype='china'
        ) 
map1.set_global_opts(title_opts=opts.TitleOpts(title='各省月饼销量分布-Dragon少年'),
                     visualmap_opts=opts.VisualMapOpts(max_=300000)
                    )
map1.render("各省月饼销量分布-Dragon少年.html")
map1.render_notebook()

insert image description here
It can be seen that the most distribution of moon cakes is in Guangdong and Zhejiang provinces, and these two provinces have the most popular moon cake sales. At this point, the moon cake data crawler and analysis visualization are completed~ Have you read the analysis and want to buy moon cakes?

The source code and data have been uploaded, pay attention to the public account at the end of the article and reply to [moon cake source code] to get the complete source code

Python's past highlights:

Guess you like

Origin blog.csdn.net/hhladminhhl/article/details/120249968