Depth interpretation | how to build a user rating system to achieve fine operators? Case attached practical operation

Users can also fine classification called user portrait is a very common operation means, the aim is to better serve customers of different nature, improve the conversion rate of each link, to maximize mining customer value and create profits.

So how do you build a user-portrait, or to build the refinement of the operating system, data is actually the work of this process:

  • Coordinated and centralized portrait of relevant data
  • Find the same scenario strong business-related data
  • Data classification and tagging (qualitative to quantitative)
  • Based on business needs to import external data
  • Screening clients (the role of DMP) in accordance with business needs

The case came to share how by means of data mining for users to fine classification, the insurance industry as an example.

First, customer segmentation

Customer segmentation based on customer segmentation classification dimension, the dimension of the classification of the insurance industry in general including five categories, namely social characteristics factors, the natural properties of factors, behavioral characteristics factors, attitudes and preferences of factors living conditions and personality factors.

The first three belong to the prior classification dimension is revealed outside factors, namely by contacting factors can know; the latter two belong to the classification dimension afterwards, through research in order to understand, the difference between the intrinsic nature of the reaction of customers. For customer segmentation, we tend to do after the customer classification by classification dimension to ensure thorough classification, and then described and validated by prior classification dimension to ensure that differences in classification of customers and accessibility. Here Python Excel data reading research, the dimension of classification afterwards taken out and view type, nine dimensions are found and there appears to be some type of digital correlation between some dimensions, such correlation information might cause overlap expansion, increase classification bias, so the first of these nine dimensions of factor analysis.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

1.1 Factor Analysis

Factor analysis is to convert multiple measured variables of the few comprehensive index (also known as latent variable), it is a reflection of dimensionality reduction ideas. By reducing the variable high correlation dimension will together, thereby reducing the number of variables to be analyzed, reduce the complexity of the problem analysis.

Factor analysis is the premise has a certain relevance, it must pass the kmo and bartlett sphericity test data to factor analysis.

Before factor analysis, first test ball KMO test and Bartley, KMO test factor> 0.5, (x2 statistical significance probability value Batelite ball test) P value of <0.05, only the questionnaire construct validity can be factor analysis, factor analysis is mainly done yourself a questionnaire, how this survey to the reliability and validity of the data you want to consider, can not play the role of a representative survey of what you want, it very popular not know can Bunenglijie.

Spherical test is mainly used to test the distribution of data, as well as the situation between the various independent variables. Simply put, in accordance Ideally, if we have a variable, then all the data in a line. If there are two completely independent variables, all of the data in the two perpendicular lines. If there are three completely independent variables, all of the data in the three mutually perpendicular lines. If there are n variables, and that all data will be perpendicular to each other on a line of n in each variable range is substantially equal, as all data is distributed in a ball inside the body. Imagine the situation Wan Jian mandrel, and probably is that way. If the data is not distribution test of sphericity, factor analysis done in time would be contrary to the assumption of factor analysis - each variable independently of each other to some extent.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

Factor analysis after the adoption of the applicability of the test:

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

See nine common factors eigenvalues ​​and the variance contribution generally selected cumulative contribution of variance is greater than the common factor of 0.8, and the text selected characteristic value is greater than the common factor 1, i.e., cumulative variance contribution rate of the first four common factor 0.697 in . Then re-fit according to the four common factors.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

View the degree of extraction factor was found when using a common factor 4, four factors explain well of nine dimensions are greater than 0.6, indicating that the extract has four common factors explain certain force to the original dimension.
See four common factors then the factor loadings, to see whether the rotation.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

In the first dimension, for example, we found that the extent of four common factors in the interpretation of the first original dimensions were: 0.418, -0.046,0.697,0.293, show certain common factor associated common factor between 1 and 3 properties, reach a predetermined factor analysis results, it is necessary to rotate, so that the respective common factor having differentiated characteristics.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

Or in the first dimension, for example, we found that after the rotation through varimax, four common factors to explain the extent of the original first dimension are: -0.069,0.153,0.203,0.824, that is the first public factor 4 explanatory power of a larger dimension. Four principal factors are significant differences out of the original dimension in the post-rotation, i.e., four common factors having the characteristic difference.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

1.2 Cluster analysis

After factor analysis, we put all the customers into the customer category 4 have a difference of characteristics (9 represent classification afterwards dimensions), then we cluster analysis type and the premium amount by a factor of two dimensions.

Cluster analysis is for similarities and differences of the data set of data is divided into several categories. Commonly used cluster analysis methods kmeans, DBSCAN and hierarchical clustering. As used herein, the hierarchical clustering, hierarchical clustering because of the type of data and less demanding without knowing in advance divided into several categories, the disadvantage that large computation.

Hierarchical clustering (Hierarchical Clustering) is a clustering algorithm, created by calculating the similarity between different categories of data points in a nested hierarchical clustering tree. In the clustering tree, the different types of raw data points is the lowest layer of the tree, the top of the tree is the root node of a cluster. Create a cluster tree bottom-up and top-down split merge the two methods.

 

Figuratively, you as a company's human resources manager, you can put all the employees organized into larger clusters, such as supervisors, managers and staff; you can then further divided into smaller clusters, for example, employees can cluster further divided into sub-clusters: officers, clerks and interns. All these clusters form a hierarchical structure, can be easily aggregated or characterization data on all levels

# 因子类型以及保费金额的量纲不一致,需进行标准化处理
result['因子类型'] = result['因子类型'].astype('int64')
result['Z因子类型'] = (result['因子类型']-result['因子类型'].mean())/result['因子类型'].std()
result['Z保费金额'] = (result['保费金额']-result['保费金额'].mean())/result['保费金额'].std()
result = result.set_index(result['问卷编号'])
#层次聚类分析
Z = hierarchy.linkage(result[['Z保费金额', 'Z因子类型']],
                      method='ward', metric='euclidean')
hierarchy.dendrogram(Z, labels=result.index)
# 看效果图,分为5类比较合适,即高度大概在13左右
label = hierarchy.cut_tree(Z, height=13)
label = label.reshape(label.size,)
result['细分类型'] = list(label)

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

By hierarchical cluster analysis, we will all customers are divided into five categories. The next test the effect of the classification of our analysis methods. We found that both types of premium amount or factor, by grouping the hierarchical clustering, p = less than 0.05, there is a significant difference between the groups i.e., good clustering results.

#使用方差分析检验                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              聚类效果
a = anova_lm(ols('保费金额~C(细分类型)', data=result[['保费金额', '细分类型']]).fit())[:1]
b = anova_lm(ols('因子类型~C(细分类型)', data=result[['因子类型', '细分类型']]).fit())[:1]
f_oneway_result = pd.concat([a.iloc[:, 3:], b.iloc[:, 3:]])
f_oneway_result['列名'] = ['保费金额', '因子类型']
print(f_oneway_result)
                  F         PR(>F)       列名
C(细分类型)  306.108565  1.157673e-152  保费金额
C(细分类型)  742.643495  1.999808e-251  因子类型

By univariate analysis of variance, we know that there is a significant difference between the groups segment type, then show how this difference out of it? For categorical data we share compared to the numerical data used for comparison with the mean, and then subdivided according to the type of naming the premium amount and factor category.

#保费金额使用均值比较,因子类别使用占比比较
nor = pd.crosstab(result['细分类型'], result['因子类型'],
                  normalize=0)  # normalize = 0按行求占比
mean = result.groupby('细分类型')['保费金额'].mean()
result_xf = pd.concat([nor, mean], axis=1)
print(result_xf)
         1         2         3         4         保费金额
细分类型                                                     
0     0.000000  0.603774  0.396226  0.000000  1481.796226
1     0.657407  0.342593  0.000000  0.000000  2098.268056
2     0.000000  0.000000  0.578947  0.421053  2779.996241
3     0.000000  0.000000  0.000000  1.000000  1708.326829
4     0.530864  0.259259  0.197531  0.012346  3780.096296
#各细分类型命名
result['细分类型'] = result['细分类型'].map(
    {0: '低端居家型客户', 1: '中端享受型客户', 2: '中端外向型客户', 3: '中端自信型客户', 4: '高端享受型客户'})

Second, select target customers

After the customer segments to select target customers. Select target customers are mainly from two dimensions to measure, customer attractiveness and competitiveness of enterprises.

Attractive business is mainly reflected in the number of each insurance company has various sub-types of customers, namely market share.

Customer attraction includes two aspects, one customer size, and second, the amount of the premium, according to their company needs, by weight of 6: 4 calculated the customer appeal.

#统计客户吸引力和企业竞争力
result_final = pd.DataFrame()
result_final['客户数量'] = result.groupby('细分类型')['问卷编号'].count()
result_final['保费金额'] = result.groupby('细分类型')['保费金额'].mean()
result_final['客户规模'] = result_final['客户数量']/result_final['客户数量'].sum()
result_final['客户规模标准化'] = (
    result_final['客户规模']-result_final['客户规模'].mean())/result_final['客户规模'].std()
result_final['保费金额标准化'] = (
    result_final['保费金额']-result_final['保费金额'].mean())/result_final['保费金额'].std()
result_final['客户吸引力'] = 0.6*result_final['客户规模标准化']+0.4*result_final['保费金额标准化']
result2 = pd.crosstab(result['细分类型'], result['保险公司的选择'], normalize=0)
result2.columns = ['甲', '乙', '丙', '丁']
result_final['企业竞争力'] = result2['甲']
print(result_final)

            客户数量   保费金额   客户规模   客户规模标准化   保费金额标准化   客户吸引力   企业竞争力
细分类型                                                                        
中端享受型客户   216  2098.268056  0.303371  1.477388 -0.291968  0.769645  0.240741
中端外向型客户   133  2779.996241  0.186798 -0.188688  0.441347  0.063326  0.458647
中端自信型客户   123  1708.326829  0.172753 -0.389420 -0.711415 -0.518218  0.162602
低端居家型客户   159  1481.796226  0.223315  0.333215 -0.955087 -0.182106  0.119497
高端享受型客户    81  3780.096296  0.113764 -1.232494  1.517124 -0.132647  0.320988
#矩阵分析图
plt.rcParams['font.sans-serif'] = 'Simhei'
plt.rcParams['axes.unicode_minus'] = False
plt.subplot(1, 1, 1)
plt.scatter(result_final['企业竞争力'],
            result_final['客户吸引力'], s=200, c='r', marker='o')
plt.hlines(y=0, xmin=0, xmax=0.5)
plt.vlines(x=0.25, ymin=-1.2, ymax=1.2)
plt.xlabel('企业竞争力')
plt.ylabel('客户吸引力')
for a, b, c in zip(result_final['企业竞争力'], result_final['客户吸引力'], result_final.index):
    plt.text(a, b, c, ha='center', va='bottom', fontsize=10)

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

From the above chart, the end client is the preferred customer-oriented company A, followed by the end customer enjoyment and enjoyment of high-end customers, while the low-end and midrange customers domesticated type of customer confidence may temporarily without sufficient resources give up.

Third, the target customers

Target customers consists of two parts, one target customers look like, which is user-portrait, by prior classification dimension describes the target customer portrait; and second, what is the target customer demand, the demand for precision marketing. Analysis is still first analysis of variance, analysis of variance dimensions of performance difference was compared with the mean or proportion, and finally through correspondence analysis to show the effect.

3.1 target customers portrait

Prior classification dimension total of six, are the city, age, gender, family income, car prices, education and occupation. Discovery, education, and occupation were not significantly different among the groups segment type, it continues to ignore these two dimensions analyzed by analysis of variance.

#事前分类维度方差分析
result['职业'] = result['职业'].replace(' ', '6').astype('int64')
target_sd = []
for i in ['性别', '年龄', '城市', '家庭月收入', '汽车价格', '学历', '职业']:
    formula = '(' + str(i) + '~' + 'C(' + '细分类型)' + ')'
    a = anova_lm(ols(formula, data=result[[i, '细分类型']]).fit())[:1]
    target_sd.append(pd.DataFrame(
        {'c': str(i), 'F': a['F'], 'PR(>F)': a['PR(>F)']}))

target_result = pd.concat(target_sd)
target_result = target_result[target_result['PR(>F)'] < 0.05]

#具有显著性差异的维度结果展示
print(target_result)
                c            F         PR(>F)
C(细分类型)     性别    57.940193   2.614665e-42
C(细分类型)     年龄   553.274636  4.801252e-216
C(细分类型)     城市  3629.629395   0.000000e+00
C(细分类型)  家庭月收入   268.460859  3.193752e-140
C(细分类型)   汽车价格   901.193079  7.780527e-276
#具有显著性差异的维度命名
Y = result[['性别', '年龄', '城市', '家庭月收入', '汽车价格', '细分类型']]
Y['性别'] = Y['性别'].map({1: '男', 2: '女'})
Y['年龄'] = Y['年龄'].map({1: '18-30岁', 2: '31-40岁', 3: '41岁以上'})
Y['城市'] = Y['城市'].map(
    {1: '北京', 2: '上海', 3: '武汉', 4: '沈阳', 5: '广州', 6: '西安', 7: '成都'})
Y['家庭月收入'] = Y['家庭月收入'].map(
    {1: '小于7000元', 2: '7000-10000元', 3: '10000-15000元', 4: '15000-20000元', 5: '20000元以上'})
Y['汽车价格'] = Y['汽车价格'].map(
    {1: '10万元以下', 2: '10-20万元', 3: '20-30万元', 4: '30万元以上'})

# 多因子对应分析
mca = prince.MCA(n_components=2, n_iter=10, random_state=1)
mca = mca.fit(Y)
ax = mca.plot_coordinates(
    X=Y,
    ax=None,
    figsize=(10, 6),
    show_row_points=False,
    show_column_points=True,
    column_points_size=100,
    show_column_labels=True,
    legend_n_cols=1
)

Seen from the correspondence analysis renderings, Company A preferred target customers in export-oriented end customers, mainly in Beijing. Wuhan distribution ratio is also higher than other types of segments, concentrated in the age between 31-40, male gender, family income between 15000-20000 yuan, car prices at between 20-30 million.

Depth interpretation | how to build a user rating system to achieve fine operators?  Case attached practical operation

 

3.2 target customer needs analysis

What are the needs of the target customer focus, how to analyze? Each segment is still selected types of customers, and analysis of variance for each dimension, a dimension after the analysis of variance test were compared with the mean or proportion. By analysis of variance without dimension, the direct comparison of each dimension, with a numeric mean, by accounting for the type of end use customer oriented.

Specific see below:

#事前分类维度方差分析
result['职业'] = result['职业'].replace(' ', '6').astype('int64')
target_sd = []
for i in ['性别', '年龄', '城市', '家庭月收入', '汽车价格', '学历', '职业']:
    formula = '(' + str(i) + '~' + 'C(' + '细分类型)' + ')'
    a = anova_lm(ols(formula, data=result[[i, '细分类型']]).fit())[:1]
    target_sd.append(pd.DataFrame(
        {'c': str(i), 'F': a['F'], 'PR(>F)': a['PR(>F)']}))

target_result = pd.concat(target_sd)
target_result = target_result[target_result['PR(>F)'] < 0.05]

#具有显著性差异的维度结果展示
print(target_result)
                c            F         PR(>F)
C(细分类型)     性别    57.940193   2.614665e-42
C(细分类型)     年龄   553.274636  4.801252e-216
C(细分类型)     城市  3629.629395   0.000000e+00
C(细分类型)  家庭月收入   268.460859  3.193752e-140
C(细分类型)   汽车价格   901.193079  7.780527e-276
#具有显著性差异的维度命名
Y = result[['性别', '年龄', '城市', '家庭月收入', '汽车价格', '细分类型']]
Y['性别'] = Y['性别'].map({1: '男', 2: '女'})
Y['年龄'] = Y['年龄'].map({1: '18-30岁', 2: '31-40岁', 3: '41岁以上'})
Y['城市'] = Y['城市'].map(
    {1: '北京', 2: '上海', 3: '武汉', 4: '沈阳', 5: '广州', 6: '西安', 7: '成都'})
Y['家庭月收入'] = Y['家庭月收入'].map(
    {1: '小于7000元', 2: '7000-10000元', 3: '10000-15000元', 4: '15000-20000元', 5: '20000元以上'})
Y['汽车价格'] = Y['汽车价格'].map(
    {1: '10万元以下', 2: '10-20万元', 3: '20-30万元', 4: '30万元以上'})

# 多因子对应分析
mca = prince.MCA(n_components=2, n_iter=10, random_state=1)
mca = mca.fit(Y)
ax = mca.plot_coordinates(
    X=Y,
    ax=None,
    figsize=(10, 6),
    show_row_points=False,
    show_column_points=True,
    column_points_size=100,
    show_column_labels=True,
    legend_n_cols=1
)	

From the above analysis:

  • A company's target customers in export-oriented client-side factors in the choice of insurance companies considered, the more concerned about the multi-service network, friends recommendation and trust sales people, with particular attention to relatives and friends recommended.
  • In satisfaction analysis, we found the end of the export-oriented auto insurance customers are not satisfied with the current purchase, satisfaction is only 1.5%, there is still much room for improvement, the specific reasons are not satisfied with the needs further research.
  • Mid average auto insurance premium export-oriented customers in 2780 yuan, pay more attention than other personalized customer segments, it can be studied pricing strategy and a number of personalized products.

to sum up

Finally, the method of the portrait on the market a lot of users, many companies also offer user portrait service, users will upgrade to the very portrait is one thing to force the grid. Financial companies are the first portrait of the user industry, because has a wealth of data, while financial enterprises in user portrait data on many latitudes can not start, always considered better user portrait latitude data, the richer the better the picture data some data entry also sets the weights and even established a model, users do portrait is a huge and complex project. But after consuming a lot of energy they were portraits, but found only a portrait of the user, and business together far, there is no way to directly support business operations, invested a huge effort but the rewards small, it can be said that more harm than good, can not explain to the leadership.

In fact, user portrait latitude involving data needs of the business combination scenario, both simple and competent service but also a strong correlation, both screened convenient but also easy to further action. Users need to adhere to three principles portrait, namely demographic attributes and credit information-based, strong information-based, qualitative data based.

1, attribute-based credit information and population

A lot of information describing the user's credit information is user portrait important information, credit information is a description of a man spending power of information in society. The goal of any business user portrait is to find the target customers, it must be a user with the potential spending power. Credit information can directly prove the customer's spending power, it is the most important and portraits user-based information. Joke, all information is credit information is the truth. Which contains consumer information work, income, education, property and so on.

After completion of target customers, companies need to touch up the customer demographic attributes of customer contact information is to play the role of demographic attribute information includes name, gender, telephone number, email address, home address and other information. This information can help contact the customer, to sell products and services to customers

2, using a strong correlation information, information related to ignore weak

Strong correlation information is directly related to the needs of the same scene, which may be a causal information, may also be related to a high degree of information.

If use is defined as a 0 to 1 range of the correlation coefficient, the correlation coefficient of 0.6 or more strength should be defined as the relevant information. For example, other things being equal under the premise of 35-year-old man of average wages higher than the average age was 30 years of age, computer science graduate student of philosophy majors higher than average wages, average wages in the financial sector than in the textile work the average wage of the industry, the average wage in Shanghai more than the average wage in Hainan Province. From this information we can see a greater impact of come of age, education, occupation, place of income, with the level of income is a strong correlation. Simple will, affect the larger message is a strong correlation information to credit information, and vice versa is weakly correlated information.

Other user information, such as information about the user's height, weight, name, sign, etc., is difficult to analyze the impact on the probability of its spending power, these weak-related information, which should not put user portrait for analysis, the user's credit spending power has little effect, it does not have great commercial value.

Portrait of a user and user analysis, information needs to be considered strong, not weak consider relevant information, this is a portrait of the principle user.

3, the amount of information classified as qualitative information

Portrait of a destination user is the target customers for the product selection, quantitative information is not conducive to the customer screening, quantitative information needs to be transformed into qualitative information, to filter information by category crowd.

For example, the age of the customer to be divided, 18-25 years old is defined as a young man, 25 years old -35 years old is defined as middle-aged, middle-aged defined as 36-45 and so on. Can refer to personal income information will be defined population of high-income people, middle-income groups, low-income people. Reference Asset customer information can also be defined as high, medium and low levels. Categories and ways and means of qualitative information, you can finance from their own business, there is no fixed pattern.

The various categories of quantitative information together, to categorize qualitative information, and qualitative, and beneficial for the user to filter, quickly locate target customers, it is another principle user portrait.

4, user portrait method described, not too complicated

It requires a combination of business user needs portrait, from a practical point of view, for example, where we will users be divided into five categories of information Portrait of information. Social characteristics are factors, the natural properties of factors, behavioral characteristics factors, attitudes and preferences of factors living conditions and personality factors. They cover the basic needs strong business-related information needs, combined with external scene data will have huge commercial value. Particularly complex user portrait latitude latitude for example, eight, ten, latitude information is not conducive to business applications, other valuable information, which basically can be summarized into five latitude. Users too complex portrait of this work, not much business sense.

Finally, the above it is Benpian share, want to be useful.

Published 441 original articles · won praise 3189 · Views 1.02 million +

Guess you like

Origin blog.csdn.net/yuanziok/article/details/104895876