Python actual combat | "Dragon Boat Festival" Send relatives, send elders, dumplings visualization big screen to help!

Official account: " 杰哥的IT之旅", backstage reply: " 粽子" to get the complete data of this article

Introduction to this article

This year, I used Python to crawl the "zongzi data" on JD.com for analysis, and see what I found!

This article starts from three aspects: data crawling , data cleaning , and data visualization , but you simply complete a small data analysis project, so that you can have a comprehensive application of knowledge.

The whole idea is as follows:

  • Crawl webpage: https://www.jd.com/
  • Crawling instructions: Based on the Jingdong website, we searched the website for "zongzi" data, there are about 100 pages. The fields we crawl include both the relevant information of the first-level page and some information of the second-level page;
  • Crawling idea: first perform an analysis on the first-level page of a certain page of data, then perform an analysis on the second-level page, and finally perform the page turning operation;
  • Crawl fields: the name (title), price, brand (shop), category (flavor) of the dumplings;
  • Use tools: requests+lxml+pandas+time+re+pyecharts
  • Website parsing method: xpath

The final effect is as follows:

picture

data scraping

The Jingdong website is generally loaded dynamically , that is to say, only the first 30 data of a certain page can be crawled in a general way (a page has a total of 60 data).

Based on this article, I only used the most basic method to crawl the first 30 pieces of data on each page (if you are interested, you can go down and crawl all the data yourself).

So, what fields are crawled in this article? I will show you a demonstration . If you are interested, you can crawl more fields for more detailed analysis.

picture

The following shows the crawler code for you:

import pandas as pd
import requests
from lxml import etree
import chardet
import time
import re

def get_CI(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}
    rqg = requests.get(url,headers=headers)
    rqg.encoding = chardet.detect(rqg.content)['encoding']
    html = etree.HTML(rqg.text)
    
    # 价格
    p_price = html.xpath('//div/div[@class="p-price"]/strong/i/text()')
    
    # 名称
    p_name = html.xpath('//div/div[@class="p-name p-name-type-2"]/a/em')
    p_name = [str(p_name[i].xpath('string(.)')) for i in range(len(p_name))]
    
    # 深层url
    deep_ur1 = html.xpath('//div/div[@class="p-name p-name-type-2"]/a/@href')
    deep_url = ["http:" + i for i in deep_ur1]
    
    # 从这里开始,我们获取“二级页面”的信息           
    brands_list = []
    kinds_list = []
    for i in deep_url:
        rqg = requests.get(i,headers=headers)
        rqg.encoding = chardet.detect(rqg.content)['encoding']
        html = etree.HTML(rqg.text)
                          
        # 品牌
        brands = html.xpath('//div/div[@class="ETab"]//ul[@id="parameter-brand"]/li/@title')
        brands_list.append(brands)
                        
        # 类别
        kinds = re.findall('>类别:(.*?)</li>',rqg.text)
        kinds_list.append(kinds)
                           
    data = pd.DataFrame({'名称':p_name,'价格':p_price,'品牌':brands_list,'类别':kinds_list})
    return(data)
                           
x = "https://search.jd.com/Search?keyword=%E7%B2%BD%E5%AD%90&qrst=1&wq=%E7%B2%BD%E5%AD%90&stock=1&page="
url_list = [x + str(i) for i in range(1,200,2)] 
res = pd.DataFrame(columns=['名称','价格','品牌','类别'])

# 这里进行“翻页”操作
for url in url_list:
    res0 = get_CI(url)
    res = pd.concat([res,res0])
    time.sleep(3)

# 保存数据
res.to_csv('aliang.csv',encoding='utf_8_sig')

The final crawled data:

picture

Data cleaning

As can be seen from the above figure, the entire data is very neat, not particularly messy , we only need to do some simple operations.

First use the pandas library to read the data.

import pandas as pd

df = pd.read_excel("粽子.xlsx",index_col=False)
df.head()

The result is as follows:

picture

We remove the square brackets for the two fields of "brand" and "category " respectively.

df["品牌"] = df["品牌"].apply(lambda x: x[1:-1])
df["类别"] = df["类别"].apply(lambda x: x[1:-1])
df.head()

The result is as follows:

picture

The top 10 stores of Zongzi brand

df["品牌"].value_counts()[:10]

The result is as follows:

picture

Top 5 Zongzi flavors

def func1(x):
    if x.find("甜") > 0:
        return "甜粽子"
    else:
        return x
df["类别"] = df["类别"].apply(func1)
df["类别"].value_counts()[1:6] 

The result is as follows:

picture

Zongzi sales price range division

def price_range(x): # 按照我的购物习惯,划分价格
    if x <= 50:
        return '<50元'
    elif x <= 100:
        return '50-100元'
    elif x <= 300:
        return '100-300元'
    elif x <= 500:
        return '300-500元'
    elif x <= 1000:
        return '500-1000元'
    else:
        return '>1000元'

df["价格区间"] = df["价格"].apply(price_range)
df["价格区间"].value_counts()

The result is as follows:

picture

Since there is not a lot of data, there are not many fields, and there is not a lot of random data. Therefore, operations such as data deduplication and missing value filling are not performed here . Therefore, you can go down to get more fields and more data for data analysis.

data visualization

As the saying goes: words are not as good as tables, tables are not as pictures . Through visual analysis, we can reveal the "hidden" information behind the data.

Expansion: Of course, this is just "introduction" , I didn't get too much data, and I didn't get too many fields. Here is a homework question for friends who are studying, and go down to use more data and more fields to do a more thorough analysis.

Here, we make a visual display based on the following questions, namely:

  • Top 10 column chart of Zongzi sales stores;
  • Top 5 column chart of Zongzi taste ranking;
  • The pie chart of the sales price range of rice dumplings;
  • Zongzi product name word cloud map;

Due to the layout of the entire article, the code for the visualization part of this article is available at the end of this article.

Top 10 Column Chart of Zongzi Sales Stores

picture

Conclusion analysis: Last year, we analyzed some moon cake data. The brands "Wufangzhai " and "Beijing Daoxiangcun" are still fresh in my memory. Like "sanquan" and "missing" , I always thought they only made dumplings and glutinous rice balls in my mind. Is zongzi worth a try? Of course, there are also some new brands here, such as "Zhu Boss" , "Daoxiang Private House" and other brands, you can go down and search. When you buy something, you have to choose carefully, and the brand is also important.

Top 5 Column Chart of Zongzi Taste Ranking

picture

Conclusion analysis: In my impression, the most food I ate when I was a child was "sweet zongzi" . I didn't know until I was in junior high school that the zongzi can still have meat? Of course, it can be seen from the picture that there are still many shops selling "fresh meat dumplings" . After all, this gift seems to be high-end and atmospheric. There are also some flavors here, such as "candied jujube dumplings" and "bean paste dumplings" , which I have basically never eaten. If you gave it away, what flavor would you give it?

Pie chart of sales price range of rice dumplings

picture

Conclusion Analysis: Here, I deliberately subdivide the price range. This pie chart is also very realistic. After all, the Dragon Boat Festival is held once a year, and it is still dominated by small profits but quick turnover. Nearly 80% of the dumplings are priced below 100 yuan . Of course, there are also some mid-range dumplings that cost 100-300 yuan. More than 300 yuan, I don't think it is necessary to eat, anyway, I will not spend so much money to buy zongzi.

Zongzi product name word cloud

picture

Conclusion analysis: From the picture, we can roughly see the selling point of the merchant. After all, it is a festival, and "gifts" and "gifts" reflect the festive atmosphere. "Pork" and "bean paste" reflect the taste of zongzi. Of course, is it a good "breakfast" option? If you buy it, it also supports "group purchase" . These words, more or less, will attract the attention of some people.

Graphic combination for large screen

picture

The visualization in this article uses the pyecharts library to draw. We first do each picture separately, and then integrate the graphics to make a beautiful large-scale visualization. For how to make it, you can get my source code file at the end of the article! 【Zongzi.ipynb】

picture

Recommended reading

I go, it turns out that everyone is buying this lipstick on 520 Valentine's Day!

Using Python to crawl 13,966 operation and maintenance job postings, what conclusions do I draw?

Using Python to crawl 37,483 pieces of second-hand housing information in Shanghai, what is my conclusion?

I used Python to analyze the sales of a cosmetics company. What is my conclusion?

I used Python to analyze a wave of hot-selling New Year's goods. It turns out that everyone is buying these things?

Original link: Python actual combat | "Dragon Boat Festival" Send relatives, send elders, dumplings visualization big screen to help!


Original is not easy, coding is not easy. If you think this article is useful to you, please like this article, leave a message or forward it, because this will be my motivation to output more high-quality articles, thank you!

⬇⬇⬇⬇⬇⬇⬇⬇

Guess you like

Origin blog.csdn.net/jake_tian/article/details/117850238