Complete Data Analysis Process: How Pandas in Python Solve Business Problems

opening

As a glue-like glue language, Python is almost omnipotent, and its role in the field of data science is even more irreplaceable. In the hard power of data analysis, Python is a tool worth learning.

Among them, Pandas is the module most used by data analysts. If you are already in contact with it, you might as well go through the complete data analysis process and explore how Pandas solves business problems.

data background

In order to use as many different Pandas functions as possible, I designed a weird but actually very real data. To put it bluntly, there are many irregularities waiting for us to clean.

The data source is adapted from a supermarket order, and the file path is attached at the end of the article.

Import required modules

import pandas as pd
复制代码

data import

Pandas provides a wealth of data IO interfaces, the most commonly used of which are pd.read_exceland pd.read_csvfunctions.

data = pd.read_excel('文件路径.xlsx',
                    sheet_name='分页名称')
data = pd.read_csv('文件路径.csv')
复制代码

Import multiple pages of data from the supermarket dataset:

orders = pd.read_excel('超市数据集.xlsx', 
                       sheet_name= '订单表')
customers = pd.read_excel('超市数据集.xlsx', 
                       sheet_name= '客户表')
products = pd.read_excel('超市数据集.xlsx', 
                       sheet_name= '产品表')
复制代码

In addition to importing data, this link also needs to have a preliminary understanding of the data, clarify which fields are there, and their definitions

Here we use to pd.Series.head()view the fields and sample data of each data table

Clarify business problems and analysis ideas

In the actual combat of business analysis, before starting the analysis, it is necessary to clarify the analysis goals, reverse the analysis methods, analysis indicators, and then reverse the required data.

This is the grounded thinking of "beginning with the end in mind".

It is assumed that the business requirement is to form a differentiated user operation strategy through user layered operation. After evaluation, data analysts believe that customers can be grouped based on the RFM user value model , and operational strategies can be formulated based on the characteristics of different ethnic group portraits. Users are price-sensitive loyal customers who need to stimulate consumption through discounts.

Therefore, the analysis method here is to group the existing users by the RFM model, and provide strategic suggestions for the business by counting the data characteristics of each group.

After clarifying the business needs and analysis methods, we can determine the statistics of customers' R, F, M, and customer unit price for portrait analysis, and then we can enter the next step.

Feature engineering and data cleaning

There is a saying in data science called "Garbage In, Garbage Out", which means that if the quality of the data used for analysis is poor and there are many errors, then no matter how meticulous and complex the analysis method is, it will not be able to change the flower, and the result will still be the same. is not available.

So there is a saying that 80% of the work of data scientists is doing data preprocessing.

Feature engineering is mainly used in the process of machine learning algorithm model. It is a system engineering for the best effect of the model, including data preprocessing (Data PrePorcessing), feature extraction (Feature Extraction), feature selection (Feature Selection) and feature construction ( Feature Construction) and other issues.

Straightforwardly, it can be broken down into two parts:

  • Data preprocessing can be understood as what we often call data cleaning ;
  • Feature construction , such as in the construction of the RFM model and group user portraits, tags such as R, F, M, and customer unit price are the corresponding features.

(Of course, RFM is not a machine learning model, here is an explanation for ease of understanding.)

data cleaning

What is Data Cleansing? Data cleaning refers to finding "outliers" in the data and "processing" them, so that the conclusions at the data application level are closer to the real business.

outliers:

  • Irregular data, such as null values, repeated data, useless fields, etc., need to pay attention to whether there are unreasonable values, such as internal test orders in the order data, customers over 200 years old, etc.
  • Pay special attention to whether the data format is reasonable, otherwise it will affect the error reporting of table merging and aggregation statistics, etc.
  • Data that does not meet business analysis scenarios, such as analyzing user behavior in 2019-2021, behaviors outside this time period should not be included in the analysis

How to deal with:

  • Under normal circumstances, for outliers, you can directly eliminate them
  • However, when there is relatively little data or the feature is more important, outliers can be handled in a richer way such as replacing them with average values.

After understanding the meaning of data cleaning, we can start to use Pandas to practice this part.

type of data

It is first pd.dtypesused to check whether the data field is reasonable

It is found that the order date and quantity are of Object (generally character) type, and they cannot be used for calculation later, and the character type needs to be modified by the pd.Series.astype()or methodpd.Series.apply()

orders['订单日期'] = orders['订单日期'].astype('datetime64')
orders['数量'] = orders['数量'].apply(int)
复制代码

In addition, the processing of time types can also be pd.to_datetimedone by:

orders['订单日期'] = pd.to_datetime(orders['订单日期'])
复制代码

modify field name

Experienced data analysts found that there is also a problem with the field name. 订单 IdThere are spaces that are not convenient for subsequent references. It is necessary to pd.rename()modify the field name by

orders = orders.rename(columns={'订单 Id':'订单ID',
                                '客户 Id':'客户ID',
                                '产品 Id':'产品ID'})
customers = customers.rename(columns={'客户 Id':'客户ID'})
复制代码

multi-table join

After processing the field names and data types, you can use it pd.mergeto connect multiple tables.

There are two ways in the table connection on, one is that the field names used for the connection of the two tables are the same, just use it directly on, if they are not the same, you need to left_on, right_onuse it.

data = orders.merge(customers, on='客户ID', how='left')
data = data.merge(products, how='left', 
                  left_on='产品ID', right_on='物料号')
复制代码

Eliminate redundant fields

For the second case, the resulting table will have two fields with the same meaning but different names, and pd.dropredundant fields need to be removed. In addition, "Row Id" is a useless field here, so it should be removed together.

data.drop(['物料号','行 Id'],axis=1,
                            inplace=True)
复制代码

The adjusted table structure:

Text processing - remove data that does not meet business scenarios

According to business experience, there may be some internal test data in the order table, which will affect the analysis conclusion, and they need to be found out and eliminated. After communicating with the business or operation and maintenance, it is clear that the identification of the test order is the word "Test" in the "Product Name" column.

Because it is text content, you need to pd.Series.str.containsfind and remove them by

data = data[~data['产品名称'].str.contains('测试')]
复制代码

Time processing - remove non-analyzed range data

Factors affecting consumers have the characteristic of decreasing time windows. For example, if you bought a cute hat 10 years ago, it does not mean that you still need cute and cute products today, because 10 years is enough for you to undergo many changes; But if you bought a country dress 10 days ago, you can be confident that you still like country products, because your preferred style will not change much in the short term.

In other words, in user behavior analysis, behavior data has a certain timeliness , so it is necessary to clarify the time range in combination with business scenarios, and then pd.Series.between()use it to filter order data that nearly meets the time range for RFM modeling analysis.

data= data[data['订单日期'].between('2019-01-01','2021-08-13')]
复制代码

feature structure

The purpose of this link is to construct an analysis model, that is, the RFM model and the feature fields required for grouping portrait analysis.

Data Aggregation - Customer Consumption Characteristics

First, the consumption characteristics of customers in the RFM model:

  • R: The distance between the customer's latest purchase and the analysis date (set to 2021-08-14), used to determine the active status of the purchasing user
  • F: customer consumption frequency
  • M: customer consumption amount

These are aggregations of consumption data over a period of time, so they can be pd.groupby().agg()implemented with

consume_df = data.groupby('客户ID').agg(累计消费金额=('销售额',sum), 
                         累计消费件数=('数量',sum),
                         累计消费次数=('订单日期', pd.Series.nunique), 
                         最近消费日期=('订单日期',max)
                        )
复制代码

Among them, the R value is special, and the datetime module needs to be borrowed to calculate the distance between dates

from datetime import datetime
consume_df['休眠天数'] = datetime(2021,8,14) - consume_df['最近消费日期']
consume_df['休眠天数'] = consume_df['休眠天数'].map(lambda x:x.days)
复制代码

Calculated customer cumulative consumption data statistics table:

Binning processing—customer unit price range division

According to the previous analysis ideas, after completing the user grouping of the RFM model, it is necessary to count the consumption portraits of users of each ethnic group. Due to space limitations, only the distribution characteristics of the customer unit price of each ethnic group are counted here.

At this point, after calculating the customer unit price data, it is necessary to use pd.cutthe customer unit price to perform boxing operations to form price ranges.

consume_df['客单价'] = consume_df['累计消费金额']/consume_df['累计消费次数']
consume_df['客单价区间']  = pd.cut(consume_df['客单价'],bins=5)
复制代码

Use pd.Series.value_countsthe method to count the distribution of customer unit price intervals:

pd.cutThe bins parameter in is the number of intervals for dividing the customer unit price, if you fill in 5, it will be divided into 5 bins on average. Of course, it’s still the same sentence, this needs to be clarified with the business in practice, or determined in conjunction with the business scenario.

RFM modeling

After completing the data cleaning and feature construction, it enters the modeling and analysis link.

Tukey's Test Outlier Detection

According to analysis experience, outliers will greatly affect statistical indicators and cause large errors. For example, if Jack Ma is placed in your class, the average asset of the class can be calculated to be tens of billions. Here, Jack Ma is an outlier, which needs to be removed.

Therefore, before starting to calculate the RFM threshold, it is necessary to perform outlier detection on the values ​​​​of R, F, and M.

Here we use the Turkey's Test method , which is simply to form a numerical interval through the operation between quantiles, and mark the data outside this interval as outliers. Students who are not clear can search on Zhihu, and I won’t talk about it here.

Turkey's Test method relies on the calculation of quantiles, in Pandas, by pd.Series.quantilecalculating quantiles

def turkeys_test(fea):
    Q3 = consume_df[fea].quantile(0.75)
    Q1 = consume_df[fea].quantile(0.25)
    max_ = Q3+1.5*(Q3-Q1)
    min_ = Q1-1.5*(Q3-Q1)
    
    if min_<0:
        min_ =0
    
    return max_, min_
复制代码

The above code implements the Tukey's Test function, where Q3 is the 75th percentile and Q1 is the 25th percentile. And min_ and max_ form a reasonable value interval, and the data outside this interval, whether it is too high or too low, is still an outlier.

Note that because min_ is a negative number, and the consumption data cannot be negative, an operation of converting it to 0 is added.

Next, add a field "whether abnormal" to the RFM feature data table, the default value is 0, and then use the Tukey's Test function to mark the abnormal data as 1, and finally only need to keep the data with a value of 0.

consume_df['是否异常'] = 0

for fea in rfm_features:
    max_, min_= turkeys_test(fea)
    outlet = consume_df[fea].between(min_,max_)  #bool
    consume_df.loc[~outlet,'是否异常']=1
    
consume_df = consume_df[consume_df['是否异常']==0]
复制代码

Clustering and 28 principles - RFM threshold calculation

Now that it is possible to ensure that the features used for modeling are valid, it is necessary to calculate the thresholds of each indicator for RFM modeling. The calculation of the threshold is generally performed by a clustering algorithm, but machine learning algorithms are not involved here. Essentially, the clustering results usually conform to the 28th principle, which means that important customer groups should only account for 20%, so we can calculate the 80th percentile to approximate the threshold of the RFM model.

M_threshold = consume_df['累计消费金额'].quantile(0.8)
F_threshold=consume_df['累计消费次数'].quantile(0.8)
R_threshold = consume_df['休眠天数'].quantile(0.2)
复制代码

RFM model calculation

After the RFM threshold is obtained, the customer’s RFM characteristics can be calculated. If it exceeds the threshold, it will be 1, and if it is lower than the threshold, it will be 0. The R value calculation logic is opposite, because the R value is the number of days of dormancy, and the larger the value, the better. Inactive.

consume_df['R'] = consume_df['休眠天数'].map(lambda x:1 if x<R_threshold else 0)
consume_df['F'] = consume_df['累计消费次数'].map(lambda x:1 if x>F_threshold else 0)
consume_df['M'] = consume_df['累计消费金额'].map(lambda x:1 if x>M_threshold else 0)
复制代码

After the customer RFM characteristics are divided into 1 and 0, that is, high and low, the grouping calculation can be performed:

consume_df['RFM'] = consume_df['R'].apply(str)+'-' + consume_df['F'].apply(str)+'-'+ consume_df['M'].apply(str)

rfm_dict = {
    '1-1-1':'重要价值用户',
    '1-0-1':'重要发展用户',
    '0-1-1':'重要保持用户',
    '0-0-1':'重要挽留用户',
    '1-1-0':'一般价值用户',
    '1-0-0':'一般发展用户',
    '0-1-0':'一般保持用户',
    '0-0-0':'一般挽留用户'
}
consume_df['RFM人群'] = consume_df['RFM'].map(lambda x:rfm_dict[x])
复制代码

So far, the RFM modeling and user grouping calculations have been completed.

group portrait

After completing the model grouping, it is necessary to count the number of people and the distribution of customer unit price for each group.

Proportion of people

The simplest profile analysis is to pd.Series.value_countscount the number of people of each ethnic group and analyze the relative proportion.

rfm_analysis = pd.DataFrame(consume_df['RFM人群'].value_counts()).rename(columns={'RFM人群':'人数'})
rfm_analysis['人群占比'] = (rfm_analysis['人数']/rfm_analysis['人数'].sum()).map(lambda x:'%.2f%%'%(x*100))
复制代码

pivot table

pd.pivot_tableThe distribution of customer unit price of each ethnic group involves multi-dimensional analysis, which can be realized through the perspective function of Pandas

In the code, I used pd.Series.nuniquethe method of the aggregation function aggfunc, which means to deduplicate the value. Here, it is to deduplicate the customer ID and count the number of customers in each price range.

pd.pivot_table(consume_df.reset_index(),    # DataFrame
        values='客户ID',    # 值
        index='RFM人群',    # 分类汇总依据
        columns='客单价区间',    # 列
        aggfunc=pd.Series.nunique,    # 聚合函数
        fill_value=0,    # 对缺失值的填充
        margins=True,    # 是否启用总计行/列
        dropna=False,    # 删除缺失
        margins_name='All'   # 总计行/列的名称
       ).sort_values(by='All',ascending=False)
复制代码

In this way, the distribution of each ethnic group in different price segments can be obtained, and the marketing strategy can be further formed with the analysis of portraits in other dimensions.

reverse pivot table

Finally, let’s do a coquettish operation, that is, the pivoted tables are multi-dimensional tables, but when we want to import them into tools such as PowerBI for visual analysis, we need to reverse pivot them pd.meltinto one-dimensional tables.

pivot_table.melt(id_vars='RFM人群',
                 value_vars=['(124.359, 3871.2]', '(3871.2, 7599.4]',
                             '(7599.4, 11327.6]', '(11327.6, 15055.8]',
                             '(15055.8, 18784.0]']).sort_values(by=['RFM人群','variable'],ascending=False)
复制代码

In this way, a table with fields named "crowd", "indicator", and "value" can present information in one line, which is a one-dimensional table . In the previous population statistics of each ethnic group, the two-dimensional table needs to locate information by row and column.

end

So far, we have established the RFM model and grouped group portrait analysis through Pandas, and completed the business analysis requirements.

Due to space limitations, this article only demonstrates the function method frequently used by Pandas in the data analysis process, and the whole analysis process is equally important. If you are not familiar with some functions, encourage students to use Zhihu or search engines to supplement their learning. At the same time, welcome to add biscuit brother WeChat to discuss.

For more instructions on how to use Pandas functions, you can check the Chinese documentation

This article can be regarded as the basic article of the data analysis process. It is planned to organize another advanced article, which involves the machine learning process and more feature engineering content. It will also be introduced in the way of business implementation .

Guess you like

Origin blog.csdn.net/weixin_73136678/article/details/128806368