[Machine learning notes] Using Python to analyze airline customer value analysis notes

Using Python for airline customer value analysis (data analysis)

Learning materials:

Reference book: "Python Data Analysis and Mining Practice" (Machinery Industry Press) Chapter 7

Reference blog post: https://blog.csdn.net/a857553315/article/details/79177524

https://blog.csdn.net/weixin_39722361/article/details/79225305?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

https://www.kesci.com/home/project/5a71818f94eaea7410491462


aims

The background is introduced in the book, and it is required to achieve the following goals through data mining according to the background:

  • Use airline customer data to classify customers
  • Perform feature analysis on different customer categories and compare customer value of different customers
  • Provide personalized services to customer categories with different values ​​and formulate corresponding marketing strategies

Analysis methods and processes

FRM模型(Frequency Recency Monetary)

F: consumption frequency

R: The most recent consumption time interval

M: consumption amount

The attribute binning method of traditional RFM model analysis is shown in the figure. It is divided according to the average value of the attributes, where the value greater than the average value is indicated by the upward arrow (↑), and the value less than the average value is indicated by the downward arrow (↓). Although it can also identify the most valuable customers, but the fine There are too many customer groups, which increases the cost of targeted marketing.

The book uses the five indicators of the LRFMC model for K-Means clustering to identify the most valuable customers.

L: length of customer relationship

R: consumption time interval

F: consumption frequency

M: Flight mileage

C: average

Why choose these five indicators?

  • The two indicators of selecting the flight mileage M accumulated by the customer within a certain period of time and the average value C of the discount coefficient corresponding to the customer's cabin within a certain period of time replace the amount of consumption.
  • Considering the length of airline membership time can affect customer value to a certain extent, so the length of the customer relationship L is added to the model as another indicator to distinguish customers.

Indicator meaning:

model L R F M C
LRFMC model The number of months since the member's membership time since the end of the observation window The number of months since the customer ’s most recent trip on the company ’s plane to the end of the observation window The number of times the customer took the company plane in the observation window The accumulated mileage of customers in the observation window The average value of the discount system corresponding to the cabin seated by the customer in the observation window

Analysis steps:

(1) Selective extraction and new data extraction from airline data sources form historical data and incremental data respectively;

(2) Perform data exploration analysis (EDA) and preprocessing on the two data sets formed in step (1), including exploration analysis of missing data and outliers, data attribute specification, cleaning and transformation.

(3) Based on the passenger value LRFMC model, the modeled data formed in step (2) has been pre-processed to perform customer grouping, perform feature analysis on each customer group, and identify valuable customers;

(4) To provide customers with customized services by using different marketing methods for customers with different values ​​obtained from the model results.


Data exploration and analysis (EDA) implementation

Omit some code ...

explore = data.describe(percentiles = [], include = 'all').T #包括对数据的基本描述,percentiles参数是指定计算多少的分位数表(如1/4分位数、中位数等);T是转置,转置后更方便查阅
explore['null'] = len(data)-explore['count'] #describe()函数自动计算非空值数,需要手动计算空值数

explore = explore[['null', 'max', 'min']]
explore.columns = [u'空值数', u'最大值', u'最小值'] #表头重命名
'''这里只选取部分探索结果。
describe()函数自动计算的字段有count(非空值数)、unique(唯一值数)、top(频数最高者)、freq(最高频数)、mean(平均值)、std(方差)、min(最小值)、50%(中位数)、max(最大值)'''

explore.to_excel(resultfile) #导出结果

Next, data preprocessing.

 


Data preprocessing

Pre-processing methods such as data cleaning, attribute specification and data transformation are used.

Through EDA analysis, it is found that there are missing values ​​in the data, the minimum value of the fare is 0, the minimum value of the discount rate is 0, and the total flight kilometers are greater than 0.

Due to the large amount of original data, this type of data occupies a small proportion and has little effect on the problem, so it is discarded.

The specific method is as follows:

  • Discard records with empty fare
  • Discard records where the fare is 0, the average discount rate is not 0, and the total number of flight kilometers is greater than 0

 

It can be seen from the observation that there is an unreasonable value in the data set where the fare is zero but the flight kilometer is greater than zero, but the proportion is small, which is deleted directly here

Only records where the fare is non-zero, or the average discount rate and total flight kilometers are both 0.

 The sample value remaining after deletion is 62044, and it can be seen that the proportion of abnormal samples is less than 1.5%, so it will not have a large impact on the analysis results.

Attribute specification

Choose 6 attributes related to the LRFMC indicator. Delete irrelevant, weakly related or redundant attributes.

The original data set has too many feature attributes, and each attribute does not have dimensionality reduction features, so here are selected a few features that are more valuable to airlines for analysis. Here , the features are not selected in accordance with the book. The final selected features are the first year's total fare, the second year's total fare, the total number of kilometers traveled in the observation window, the number of flights, the average flight time interval, the maximum flight interval in the observation window, the membership time, and the end time of the observation window 8. The eight characteristics of average discount rate. The reasons for this choice are explained below:

  • The selected features are the first year's total fare, the second year's total fare, and the total number of flight kilometers in the observation window to calculate the average flight fare per kilometer, because for airlines it is not the higher the fare, the flight kilometers The longer the number, the more profitable it can be. On the contrary, the higher-end customers at close range will create greater benefits.
  • Of course, the total number of kilometers traveled and the number of flights are also important indicators to evaluate the value of a customer
  • Joining time can tell whether the customer is an old user and loyalty
  • You can determine whether the customer's flight frequency is fixed by the average flight time interval and the maximum flight interval in the observation window
  • The average discount rate can reflect the benefits customers bring to kilometers, after all, the more high-value customers enjoy the higher discount rate

 Transform features:

 

Due to the large range of differences between different attributes, here is standardized

对于K-Means方法,k的取值是一个难点,因为是无监督的聚类分析问题,所以不寻在绝对正确的值,需要进行研究试探。这里采用计算SSE的方法,尝试找到最好的K数值。编写函数如下:
def distEclud(vecA, vecB):
    """
    计算两个向量的欧式距离的平方,并返回
    """
    return np.sum(np.power(vecA - vecB, 2))
 
def test_Kmeans_nclusters(data_train):
    """
    计算不同的k值时,SSE的大小变化
    """
    data_train = data_train.values
    nums=range(2,10)
    SSE = []
    for num in nums:
        sse = 0
        kmodel = KMeans(n_clusters=num, n_jobs=4)
        kmodel.fit(data_train)
        # 簇中心
        cluster_ceter_list = kmodel.cluster_centers_
        # 个样本属于的簇序号列表
        cluster_list = kmodel.labels_.tolist()
        for index in  range(len(data)):
            cluster_num = cluster_list[index]
            sse += distEclud(data_train[index, :], cluster_ceter_list[cluster_num])
        print("簇数是",num , "时; SSE是", sse)
        SSE.append(sse)
    return nums, SSE
 
nums, SSE = test_Kmeans_nclusters(filter_zscore_data)

 Drawing

#画图,通过观察SSE与k的取值尝试找出合适的k值
# 中文和负号的正常显示
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['font.size'] = 12.0
plt.rcParams['axes.unicode_minus'] = False
# 使用ggplot的绘图风格
plt.style.use('ggplot')
## 绘图观测SSE与簇个数的关系
fig=plt.figure(figsize=(10, 8))
ax=fig.add_subplot(1,1,1)
ax.plot(nums,SSE,marker="+")
ax.set_xlabel("n_clusters", fontsize=18)
ax.set_ylabel("SSE", fontsize=18)
fig.suptitle("KMeans", fontsize=20)
plt.show()

 

 Observing the image, there is no so-called "elbow" point, which gradually decreases with the increase of the k value. Here we choose when k is taken as 4, 5, and 6, respectively, to see if we can reversely select through the analysis results. A more suitable value, the code when k takes the value 4 is as follows:

kmodel = KMeans(n_clusters=4, n_jobs=4)
kmodel.fit(filter_zscore_data)
# 简单打印结果
r1 = pd.Series(kmodel.labels_).value_counts() #统计各个类别的数目
r2 = pd.DataFrame(kmodel.cluster_centers_) #找出聚类中心
# 所有簇中心坐标值中最大值和最小值
max = r2.values.max()
min = r2.values.min()
r = pd.concat([r2, r1], axis = 1) #横向连接(0是纵向),得到聚类中心对应的类别下的数目
r.columns = list(filter_zscore_data.columns) + [u'类别数目'] #重命名表头
 
# 绘图
fig=plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, polar=True)
center_num = r.values
feature = ["入会时间", "飞行次数", "平均每公里票价", "总里程", "时间间隔差值", "平均折扣率"]
N =len(feature)
for i, v in enumerate(center_num):
    # 设置雷达图的角度,用于平分切开一个圆面
    angles=np.linspace(0, 2*np.pi, N, endpoint=False)
    # 为了使雷达图一圈封闭起来,需要下面的步骤
    center = np.concatenate((v[:-1],[v[0]]))
    angles=np.concatenate((angles,[angles[0]]))
    # 绘制折线图
    ax.plot(angles, center, 'o-', linewidth=2, label = "第%d簇人群,%d人"% (i+1,v[-1]))
    # 填充颜色
    ax.fill(angles, center, alpha=0.25)
    # 添加每个特征的标签
    ax.set_thetagrids(angles * 180/np.pi, feature, fontsize=15)
    # 设置雷达图的范围
    ax.set_ylim(min-0.1, max+0.1)
    # 添加标题
    plt.title('客户群特征分析图', fontsize=20)
    # 添加网格线
    ax.grid(True)
    # 设置图例
    plt.legend(loc='upper right', bbox_to_anchor=(1.3,1.0),ncol=1,fancybox=True,shadow=True)
    
# 显示图形
plt.show()

 When k = 5 and k = 6 respectively, make a radar chart for analysis.

 


in conclusion

Observation shows:

  • When the value of k is 4, the information contained in each group is more complicated and the features are not obvious
  • When the value of k is 5, the analysis result is more reasonable, and the five types of people divided have their own characteristics and do not repeat each other.
  • When the value of k is 6, various groups of people also have their own characteristics, but the fourth group of people is completely included in the characteristics of the fifth group of people, which means a bit redundant

In summary, when the value of k is 5, the best clustering effect is obtained, all customers are divided into 5 groups, and further analysis can obtain the following conclusions:

  • The first group of people, 10957 people, the biggest feature is that the difference between the time intervals is the largest. The analysis may be "seasonal customers." Many, such customers we need to develop under the premise of maintaining;
  • The second group of people, 14732 people, the biggest feature is the long time to join the conference, which belongs to the old customers. The average discount rate should be higher, but the average discount rate of the observation window is lower, and the total mileage and total number of times are not. High, the analysis may be lost customers, need to fight for it, try to let them "return their minds";
  • The third group of people, 22188 people, the data in all aspects are relatively low, belonging to general or low-value users
  • The third group of people, 8724 people, the biggest feature is that the average fare per kilometer and the average discount rate are the highest. It should be a business person who travels in a higher class. It should be the object of focus to maintain and also the object of development. They should actively adopt relevant preferential policies because their number of rides increases
  • The fifth cluster, 5443 people, has the most total mileage and number of flights, and the average fare per kilometer is also higher, which is the key to maintain
  • After the analysis is completed, the results coincide with the market's 28th rule. The second and third clusters with little value have the most customers, while the fourth and fifth clusters with higher value have fewer people.
     

Continually updated

 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105257644