Telling stories with data: Top 10 Statistics/Machine Learning Magic Indexes

b012579a92d15711ef2036938fafb2e9.gif

Statistics and machine learning provide a theoretical basis for data analysis. I read a lot of statistics-related books when I was getting started. The complicated formulas and derivation process made me confused for a while. For data science/analysts, how to use statistical knowledge and apply it to our analysis scenarios is more important. This article is mainly based on the actual application scenarios in the data analysis work, sharing some magic statistics/machine learning indexes, and will not elaborate on some basic indexes, principles and formula derivation.

This article is the third article in the series.

Part 1: Telling stories with data: Summary of 13 Excel advanced skills

Part 2: Telling stories with data: 17 Python usage summary based on analysis scenarios

af40c64c94b772dd16b3a1a4a48d8b9f.png

Index long-term & short-term growth calculation

▐Short   -term growth rate

  1. General growth rate growth rate: megacap growth rate; relative ranking growth rate: ranking growth rate

  2. Mixed growth rate = GMV growth rate + relative ranking growth rate

  3. Weighted mixed growth rate = index growth rate * log (1+ index)

▐Long   -term growth trend: CAGR compound growth rate

CAGR is the abbreviation of compound annual growth rate, which is a method of measuring the average growth rate of an indicator over a period of time. CAGR is often used to measure indicators such as return on investment, sales growth rate, etc.

1d4c8d90d231f985717774b3983fb05f.png

Example: start value is 5, end value is 20, and number of years is 2 (including the first and last values), then the compound growth rate = 100%.

50412bc283d726e0e8daa29f2ca4084e.png

Indicator Trend Forecasting: Time Series Method

There are three purposes of data analysis: describing the current situation, locating the cause and predicting the future. Trend forecasting is to analyze the past and present data, and then predict the future process to assist in decision-making.

▐Linear   trend forecast: Forecast.linear()

It predicts or calculates values ​​by using existing or past values. Predict y from the value of independent variable x based on the linear regression function. This function works best if there is a linear trend in the data (i.e. y depends linearly on the x value),

04827e344bbb3df97305c5202cffff0c.png

Example: select the data, insert a scatter chart with smooth lines and data marker points, the growth trend line shows the formula, and predict the same future value.

b205dbedc0f66a3a308e73ba9daa0b4a.png

▐   Seasonal Forecast: Forecast.ets()

There are more seasonal data in e-commerce, and Excel provides advanced forecasting functions for such data. This function makes this forecast through a triple exponential smoothing method. This method is a weighted method. The older the value, the smaller the weight, which means the less important it is.

  • Forecasting.ets.seasonality()

It returns the length of the seasonal cycle detected based on historical data. If some data is repeated every 3 months, then its cycle is 3.

bd01ca8437f4037a9feed467cf89fa1a.png

  • Forecast.ets()

The 4th parameter indicates the length of the seasonal pattern. The default value of 1 means automatically detect seasonality.

214a2d7d4a93ccee3a99adac4ce40a3a.png

Example: According to Forecasting.ets.seasonality(), we know that the period of the data is 3, so fill in 3 for the fourth parameter.

3637bb1f058feb0be5d5848e6f957a53.png

  • forecasting.ets.confint()

It returns the confidence interval for the predicted value for the specified target date. The default confidence interval is 95%. This means that 95% of the predicted values ​​will be within this value.

54cddf03f7bf838a2905df7d1c0086ac.png

Sample size disparity comparison: WilsonScore

When we conduct AB-test or other analysis, we always involve comparing product click rate & conversion rate.

Example: For example, product A’s exposure UV is 1000 & click UV15, product A’s exposure UV is 100000 & click UV1000, the click-through rate of product A is 1.5%, and the click-through rate of product B is 1%. I like product A because the sample sizes of A and B are quite different.

So how to judge? WilsonScore balances the influence of sample size differences and solves the accuracy problem of small samples. In essence, the Wilson interval is actually an interval estimate of the user's like rate. However, the interval estimate takes into account the situation when the sample is too small, and the interval estimate is corrected according to the sample size, so that the interval estimate can better measure the situation of different sample sizes. This scoring algorithm is often applied to the ranking of various websites. For example, Zhihu’s search ranking.

dcf5a9f627dd15d090f42f130aec8273.png

8e00a0624651b7b6c0386795acfb33ba.png

from odps.udf import annotate
import numpy as np
@annotate('string->string')
class wilsonScore(object):
    #威尔逊区间下限
    def evaluate(self,input_data):
        pos = float(input_data.split(',')[0])
        total = float(input_data.split(',')[1])
        p_z=1.96
        pos_rat = pos * 1. / total * 1.  # 正例比率
        score = (pos_rat + (np.square(p_z) / (2. * total))
                 - ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \
                (1. + np.square(p_z) / total)
        return str(score)

126f9cd56ab2d10f7d504c2c5b549486.png

Time Decay Function: Sigmoid

In the analysis process, we often encounter the need to combine long-term historical performance to score users/products/merchants for value measurement and resource allocation. The sigmoid function is also called the Logistic function, which can map a real number to the interval (0,1).

828359fdd97146cdb15cf06fed5ae652.png

Example: Sigmoid scoring for historical product performance (click-through rate & GMV)

e846a82b89004bbfcef6346fa1dc8657.png

e120d82914481de316dab44f0b99e42f.png

Three statistical correlation coefficients: Pearson&Spearman&kendall

▐Numerical   & normal distribution correlation measure: Pearson correlation coefficient

d429c3d73c7860f3b938506914122f80.png

Related functions are configured and built-in in EXCEL and DataWorks, which can be called directly

  1. EXCEL for correlation: CORREL(S24:S28,T24:T28)

  2. odps:corr(a,b)

Through the above formula, we can obtain the correlation coefficient of two numerical variables. How to evaluate the correlation between two variables? We generally use hypothesis testing to determine whether it is significant.

When performing the Pearson correlation coefficient test, it is necessary to set the significance level α first, and the commonly used significance levels are 0.05 and 0.01. Then calculate the sample correlation coefficient, and find the corresponding critical value according to the sample size n and the significance level α. If the sample correlation coefficient is greater than the critical value, the null hypothesis is rejected, and there is a significant linear correlation between the two variables; otherwise, the null hypothesis is accepted, and there is no significant linear correlation between the two variables.

Correlation coefficient threshold calculator: https://www.jisuan.mobi/gqY.html

▐Non   -numeric/non-normally distributed number correlation measure: Spearman correlation coefficient

The Spearman correlation coefficient is a way to measure correlations based on the ranks of random variables rather than their raw values. The calculation of the spearman correlation coefficient can be calculated by the method of calculating the pearson coefficient, only need to replace the original data in the original random variable with its rank order in the random variable.

Example:

Find the correlation coefficient of (1,10,100,101), (21,10,15,13) two non-normal distributions

Replace (1,10,100,101) with (1,2,3,4), (21,10,15,13) with (4,1,3,2), and then find the pearson of the two random variables after replacement The correlation coefficient is enough.

▐Ranking   correlation measure: Kendall (kendall) correlation coefficient

The kendall correlation coefficient, also known as the harmony coefficient, is also a rank correlation coefficient, and its calculation method is as follows:

For two pairs of observations Xi, Yi and Xj, Yj of X, Y, if Xi<Yi and Xj<Yj, or Xi>Yi and Xj>Yj, the two pairs of observations are said to be in the same order, otherwise they are different sequence pair.

The formula for calculating the Kendall correlation coefficient is as follows:

71ba012bf0281f7409b461f01e42c766.png

Example: Assuming we have 8 products, we want to calculate the correlation between the sales ranking and GMV ranking of the products

merchandise

A

B

C

D

E

F

G

H

sales ranking

1

2

3

4

5

6

7

8

GMV ranking

3

4

1

2

5

7

8

6

Product A ranks 1 in terms of sales volume and ranks 3 in GMV, which is larger than the GMV ranked 4-8, so it contributes 5 pairs of the same order;

The sales volume of product B is ranked 2, and the GMV is ranked 4, which is larger than the GMV ranked 5-8, so it contributes 4 pairs of the same order;

Product C ranks 3 in terms of sales volume and 1 in GMV, which is larger than the GMV ranked 4-8, so it contributes 5 same-sequence pairs;

Product D ranks 4th in terms of sales volume and 2nd in GMV, which is larger than GMV ranked 5-8, so it contributes 4 pairs of the same order;

and so on,

Conglog P = 5 + 4 + 5 + 4 + 3 + 1 + 0 + 0 = 22;

The total logarithm is (8+7+6+5+4+3+2+1)/2=28;

Different order logarithm Q=28-22;

R=((22-6)/28)=0.57。

70618d90d7b9b9234af4e5cd4bae36de.png

A measure of the similarity between two distributions: KL divergence

KL divergence is to quantify the difference between two probability distributions P and Q. The smaller the value, the more similar it is. The formula is as follows:

3446c30e679c76ff10e4c4d1efad6588.png

Example: Find the similarity of AB distribution

  1. A distribution = [0.3,0.2,0.1,0.2,0.2]

  2. B distribution = [0.1,0.3,0.1,0.2,0.3]

9f68823c0320a3de83a628731a7f5c59.png

The KL divergence is calculated as follows:

5b7da6d133ea88b227dba59a8fe1bfc5.png

9a2494d3eb93763794a0c31b721e5482.png

Curve inflection point: KneeLocator

When looking for the best user retention time point, or feature clustering to calculate the best K value, we often need to analyze the shape of the curve to find the inflection point. In python, there is a package that automatically helps us find inflection points, called knee. This package only needs to define a small number of parameters (concavity and curve direction), and it can automatically help us find the inflection point in a curve.

from kneed import KneeLocator
import matplotlib.pyplot as plt 
•
x = np.arange(1,31)
y = [0.492 ,0.615 ,0.625 ,0.665 ,0.718 ,0.762 ,0.800 ,0.832 ,0.859 ,0.880 ,0.899 ,0.914 ,0.927 ,0.939 ,0.949 ,0.957 ,0.964 ,0.970 ,0.976 ,0.980 ,0.984 ,0.987 ,0.990 ,0.993 ,0.994 ,0.996 ,0.997 ,0.998 ,0.999 ,0.999 ]
•
kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')
print(f'拐点所在的x轴是: {kneedle.elbow}')

371b6dc011fb9e02f41eeafd2504f549.png

Index weight determination method: entropy method & PCA

▐Entropy   method

The entropy method refers to a mathematical method used to judge the degree of dispersion of an index. The greater the degree of dispersion, the greater the impact of the index on the comprehensive evaluation. The entropy value can be used to judge the degree of dispersion of an indicator. The entropy method to calculate the weight steps are as follows:

STEP1: Data standardization

64b6e9fd8d9396c52275c2c98ebefe34.png

STEP2: Calculate the information entropy of each indicator

aa4c72eb6f2cf2fff9a0d09eab6c7cbf.png

STEP3: Determine the weight of each indicator

679a21c3c64a7e3cf15c6597432086da.png

The entropy method to determine the weight only considers the degree of dispersion of each index of the data, that is, the more data values, the greater the weight, and does not combine specific practical problems. Therefore, when applying the entropy method to determine the weight, it needs to be combined with specific problems before it can be used .

▐Principal   component analysis

Principal component analysis is a multivariate statistical method to investigate the correlation between multiple variables, and it studies how to reveal the internal structure among multiple variables through a few principal components, that is, derive a few principal components from the original variables, and make them As much information of the original variables as possible is retained, and they are not correlated with each other, as a new comprehensive index.

Machine learning PAI in DataWorks is configured with PCA components, which can be called directly. For details, see: https://pai.dw.alibaba-inc.com/component/detail/255?projectId=21225&spm=a2c3x.12342929.0.0.38e64a9bhMFlBH

After the algorithm is deployed, the eigenvalue and eigenvector tables in the following format are generated:

5a44a6e8fa8b63021d7d2f295b37cdc9.png

STEP1: Determine the coefficient of the index in the linear combination of each principal component

e93e737ef99452a0e9cd65082086c830.png

3012c9fbcddeef14319434592e6342cd.png

STEP2: Determine the comprehensive score model coefficients

Perform a weighted average of the three principal components of each indicator obtained in STEP1:

Example: The index 3 score model coefficient is

854b29ba04a03826081777218eacb26b.png

STEP3: Index weight normalization

That is, the coefficients of each factor in the comprehensive score model are normalized.

Example: The weight coefficient of indicator 3 is

6ff16e65ff9a7c6070f121be5a250273.png

9cb34941a0ad99205cda4088a64534bd.png

ca131a8771a2ae021d42531db5ce68ae.png

Market competition measurement/core circle selection: concentration

Pareto's law is Pareto's famous research conclusion about the distribution of social wealth in Italy proposed by Pareto in 1906: 20% of the population owns 80% of the social wealth. In data analysis, the Pareto principle is often applied in two aspects of business analysis and demand analysis.

▐  CRN

Refers to the ratio of the sum of sales of top brands in category sales to category sales. eae3bee87a6be2fc42da476fa88614cc.pngThe lower the value, the lower the market share of the head brand, the weaker the market segmentation ability of the head brand, and the relatively more opportunities for the waist and tail brands.

▐Consumption   Concentration

Refers to the proportion of users/products contributing to the top N% of the market share, that is, the proportion of users/products contributing to the top 80% of the market share in the 80/20 rule. Core categories can be selected based on market share contribution and the number of product specification pits planning.

0e4636127bd6bb8fbc18a2c87e2c3caf.png

d007a830cd688ab9eec2e55dcf42815a.png

Keyword extraction/scoring algorithm: TF-IDF

TF-IDF tends to filter out common words and retain important words. The formula is as follows:

5f754174a8c115f6e1baa30ffc76398c.png

For example: search term A has more search times in category X, but search term A has fewer searches in other categories, then search term A is more representative of category X, and the trend score is higher, and vice versa.

fc546e26b3ef3aa4c4d81042810e351a.png

epilogue

The "Storytelling with Data" series not only witnessed the growth of my math novice in the past two years, but also answered my confusion about "University knowledge is useful or useless" when I was a student. In my opinion, as a general education, universities focus on the shaping of fusion value and the cultivation of learning ability. As a beneficiary, I am very grateful that the school taught me "what" and "why", and I can spend the lowest cognitive cost to practice "how to do" after work. In the future, the mountains will be high and the rivers will be long, and I will also share more practical summaries and thoughts with you. Welcome to exchange and learn together!

aa58048b77c244ca47a2de50c7893390.png

epilogue

We are the data science team of Dajuhuasuan, responsible for supporting businesses such as Juhuasuan, tens of billions of subsidies, and daily special sales. We focus on discounts and shopping experience, through data insight, mining the value of data, and establishing consumer operation and supply operation solutions for both ends of the marketing field and service supply and demand. We work together with operations and products to create the most price-friendly shopping entrance , the most explosive marketing matrix, making goods and mental operations efficient and certain!

¤  Extended reading  ¤

3DXR Technology  |  Terminal Technology  |  Audio and Video Technology

Server Technology  |  Technical Quality  |  Data Algorithms

Guess you like

Origin blog.csdn.net/Taobaojishu/article/details/130818110