Statistics and machine learning provide a theoretical basis for data analysis. I read a lot of statistics-related books when I was getting started. The complicated formulas and derivation process made me confused for a while. For data science/analysts, how to use statistical knowledge and apply it to our analysis scenarios is more important. This article is mainly based on the actual application scenarios in the data analysis work, sharing some magic statistics/machine learning indexes, and will not elaborate on some basic indexes, principles and formula derivation.
This article is the third article in the series.
Part 1: Telling stories with data: Summary of 13 Excel advanced skills
Part 2: Telling stories with data: 17 Python usage summary based on analysis scenarios
Index long-term & short-term growth calculation
▐Short -term growth rate
General growth rate growth rate: megacap growth rate; relative ranking growth rate: ranking growth rate
Mixed growth rate = GMV growth rate + relative ranking growth rate
Weighted mixed growth rate = index growth rate * log (1+ index)
▐Long -term growth trend: CAGR compound growth rate
CAGR is the abbreviation of compound annual growth rate, which is a method of measuring the average growth rate of an indicator over a period of time. CAGR is often used to measure indicators such as return on investment, sales growth rate, etc.
Example: start value is 5, end value is 20, and number of years is 2 (including the first and last values), then the compound growth rate = 100%.
Indicator Trend Forecasting: Time Series Method
There are three purposes of data analysis: describing the current situation, locating the cause and predicting the future. Trend forecasting is to analyze the past and present data, and then predict the future process to assist in decision-making.
▐Linear trend forecast: Forecast.linear()
It predicts or calculates values by using existing or past values. Predict y from the value of independent variable x based on the linear regression function. This function works best if there is a linear trend in the data (i.e. y depends linearly on the x value),
Example: select the data, insert a scatter chart with smooth lines and data marker points, the growth trend line shows the formula, and predict the same future value.
▐ Seasonal Forecast: Forecast.ets()
There are more seasonal data in e-commerce, and Excel provides advanced forecasting functions for such data. This function makes this forecast through a triple exponential smoothing method. This method is a weighted method. The older the value, the smaller the weight, which means the less important it is.
Forecasting.ets.seasonality()
It returns the length of the seasonal cycle detected based on historical data. If some data is repeated every 3 months, then its cycle is 3.
Forecast.ets()
The 4th parameter indicates the length of the seasonal pattern. The default value of 1 means automatically detect seasonality.
Example: According to Forecasting.ets.seasonality(), we know that the period of the data is 3, so fill in 3 for the fourth parameter.
forecasting.ets.confint()
It returns the confidence interval for the predicted value for the specified target date. The default confidence interval is 95%. This means that 95% of the predicted values will be within this value.
Sample size disparity comparison: WilsonScore
When we conduct AB-test or other analysis, we always involve comparing product click rate & conversion rate.
Example: For example, product A’s exposure UV is 1000 & click UV15, product A’s exposure UV is 100000 & click UV1000, the click-through rate of product A is 1.5%, and the click-through rate of product B is 1%. I like product A because the sample sizes of A and B are quite different.
So how to judge? WilsonScore balances the influence of sample size differences and solves the accuracy problem of small samples. In essence, the Wilson interval is actually an interval estimate of the user's like rate. However, the interval estimate takes into account the situation when the sample is too small, and the interval estimate is corrected according to the sample size, so that the interval estimate can better measure the situation of different sample sizes. This scoring algorithm is often applied to the ranking of various websites. For example, Zhihu’s search ranking.
from odps.udf import annotate
import numpy as np
@annotate('string->string')
class wilsonScore(object):
#威尔逊区间下限
def evaluate(self,input_data):
pos = float(input_data.split(',')[0])
total = float(input_data.split(',')[1])
p_z=1.96
pos_rat = pos * 1. / total * 1. # 正例比率
score = (pos_rat + (np.square(p_z) / (2. * total))
- ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \
(1. + np.square(p_z) / total)
return str(score)
Time Decay Function: Sigmoid
In the analysis process, we often encounter the need to combine long-term historical performance to score users/products/merchants for value measurement and resource allocation. The sigmoid function is also called the Logistic function, which can map a real number to the interval (0,1).
Example: Sigmoid scoring for historical product performance (click-through rate & GMV)
Three statistical correlation coefficients: Pearson&Spearman&kendall
▐Numerical & normal distribution correlation measure: Pearson correlation coefficient
Related functions are configured and built-in in EXCEL and DataWorks, which can be called directly
EXCEL for correlation: CORREL(S24:S28,T24:T28)
odps:corr(a,b)
Through the above formula, we can obtain the correlation coefficient of two numerical variables. How to evaluate the correlation between two variables? We generally use hypothesis testing to determine whether it is significant.
When performing the Pearson correlation coefficient test, it is necessary to set the significance level α first, and the commonly used significance levels are 0.05 and 0.01. Then calculate the sample correlation coefficient, and find the corresponding critical value according to the sample size n and the significance level α. If the sample correlation coefficient is greater than the critical value, the null hypothesis is rejected, and there is a significant linear correlation between the two variables; otherwise, the null hypothesis is accepted, and there is no significant linear correlation between the two variables.
Correlation coefficient threshold calculator: https://www.jisuan.mobi/gqY.html
▐Non -numeric/non-normally distributed number correlation measure: Spearman correlation coefficient
The Spearman correlation coefficient is a way to measure correlations based on the ranks of random variables rather than their raw values. The calculation of the spearman correlation coefficient can be calculated by the method of calculating the pearson coefficient, only need to replace the original data in the original random variable with its rank order in the random variable.
Example:
Find the correlation coefficient of (1,10,100,101), (21,10,15,13) two non-normal distributions
Replace (1,10,100,101) with (1,2,3,4), (21,10,15,13) with (4,1,3,2), and then find the pearson of the two random variables after replacement The correlation coefficient is enough.
▐Ranking correlation measure: Kendall (kendall) correlation coefficient
The kendall correlation coefficient, also known as the harmony coefficient, is also a rank correlation coefficient, and its calculation method is as follows:
For two pairs of observations Xi, Yi and Xj, Yj of X, Y, if Xi<Yi and Xj<Yj, or Xi>Yi and Xj>Yj, the two pairs of observations are said to be in the same order, otherwise they are different sequence pair.
The formula for calculating the Kendall correlation coefficient is as follows:
Example: Assuming we have 8 products, we want to calculate the correlation between the sales ranking and GMV ranking of the products
merchandise |
A |
B |
C |
D |
E |
F |
G |
H |
sales ranking |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
GMV ranking |
3 |
4 |
1 |
2 |
5 |
7 |
8 |
6 |
Product A ranks 1 in terms of sales volume and ranks 3 in GMV, which is larger than the GMV ranked 4-8, so it contributes 5 pairs of the same order;
The sales volume of product B is ranked 2, and the GMV is ranked 4, which is larger than the GMV ranked 5-8, so it contributes 4 pairs of the same order;
Product C ranks 3 in terms of sales volume and 1 in GMV, which is larger than the GMV ranked 4-8, so it contributes 5 same-sequence pairs;
Product D ranks 4th in terms of sales volume and 2nd in GMV, which is larger than GMV ranked 5-8, so it contributes 4 pairs of the same order;
and so on,
Conglog P = 5 + 4 + 5 + 4 + 3 + 1 + 0 + 0 = 22;
The total logarithm is (8+7+6+5+4+3+2+1)/2=28;
Different order logarithm Q=28-22;
R=((22-6)/28)=0.57。
A measure of the similarity between two distributions: KL divergence
KL divergence is to quantify the difference between two probability distributions P and Q. The smaller the value, the more similar it is. The formula is as follows:
Example: Find the similarity of AB distribution
A distribution = [0.3,0.2,0.1,0.2,0.2]
B distribution = [0.1,0.3,0.1,0.2,0.3]
The KL divergence is calculated as follows:
Curve inflection point: KneeLocator
When looking for the best user retention time point, or feature clustering to calculate the best K value, we often need to analyze the shape of the curve to find the inflection point. In python, there is a package that automatically helps us find inflection points, called knee. This package only needs to define a small number of parameters (concavity and curve direction), and it can automatically help us find the inflection point in a curve.
from kneed import KneeLocator
import matplotlib.pyplot as plt
•
x = np.arange(1,31)
y = [0.492 ,0.615 ,0.625 ,0.665 ,0.718 ,0.762 ,0.800 ,0.832 ,0.859 ,0.880 ,0.899 ,0.914 ,0.927 ,0.939 ,0.949 ,0.957 ,0.964 ,0.970 ,0.976 ,0.980 ,0.984 ,0.987 ,0.990 ,0.993 ,0.994 ,0.996 ,0.997 ,0.998 ,0.999 ,0.999 ]
•
kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')
print(f'拐点所在的x轴是: {kneedle.elbow}')
Index weight determination method: entropy method & PCA
▐Entropy method
The entropy method refers to a mathematical method used to judge the degree of dispersion of an index. The greater the degree of dispersion, the greater the impact of the index on the comprehensive evaluation. The entropy value can be used to judge the degree of dispersion of an indicator. The entropy method to calculate the weight steps are as follows:
STEP1: Data standardization
STEP2: Calculate the information entropy of each indicator
STEP3: Determine the weight of each indicator
The entropy method to determine the weight only considers the degree of dispersion of each index of the data, that is, the more data values, the greater the weight, and does not combine specific practical problems. Therefore, when applying the entropy method to determine the weight, it needs to be combined with specific problems before it can be used .
▐Principal component analysis
Principal component analysis is a multivariate statistical method to investigate the correlation between multiple variables, and it studies how to reveal the internal structure among multiple variables through a few principal components, that is, derive a few principal components from the original variables, and make them As much information of the original variables as possible is retained, and they are not correlated with each other, as a new comprehensive index.
Machine learning PAI in DataWorks is configured with PCA components, which can be called directly. For details, see: https://pai.dw.alibaba-inc.com/component/detail/255?projectId=21225&spm=a2c3x.12342929.0.0.38e64a9bhMFlBH
After the algorithm is deployed, the eigenvalue and eigenvector tables in the following format are generated:
STEP1: Determine the coefficient of the index in the linear combination of each principal component
STEP2: Determine the comprehensive score model coefficients
Perform a weighted average of the three principal components of each indicator obtained in STEP1:
Example: The index 3 score model coefficient is
STEP3: Index weight normalization
That is, the coefficients of each factor in the comprehensive score model are normalized.
Example: The weight coefficient of indicator 3 is
Market competition measurement/core circle selection: concentration
Pareto's law is Pareto's famous research conclusion about the distribution of social wealth in Italy proposed by Pareto in 1906: 20% of the population owns 80% of the social wealth. In data analysis, the Pareto principle is often applied in two aspects of business analysis and demand analysis.
▐ CRN
Refers to the ratio of the sum of sales of top brands in category sales to category sales. The lower the value, the lower the market share of the head brand, the weaker the market segmentation ability of the head brand, and the relatively more opportunities for the waist and tail brands.
▐Consumption Concentration
Refers to the proportion of users/products contributing to the top N% of the market share, that is, the proportion of users/products contributing to the top 80% of the market share in the 80/20 rule. Core categories can be selected based on market share contribution and the number of product specification pits planning.
Keyword extraction/scoring algorithm: TF-IDF
TF-IDF tends to filter out common words and retain important words. The formula is as follows:
For example: search term A has more search times in category X, but search term A has fewer searches in other categories, then search term A is more representative of category X, and the trend score is higher, and vice versa.
epilogue
The "Storytelling with Data" series not only witnessed the growth of my math novice in the past two years, but also answered my confusion about "University knowledge is useful or useless" when I was a student. In my opinion, as a general education, universities focus on the shaping of fusion value and the cultivation of learning ability. As a beneficiary, I am very grateful that the school taught me "what" and "why", and I can spend the lowest cognitive cost to practice "how to do" after work. In the future, the mountains will be high and the rivers will be long, and I will also share more practical summaries and thoughts with you. Welcome to exchange and learn together!
epilogue
We are the data science team of Dajuhuasuan, responsible for supporting businesses such as Juhuasuan, tens of billions of subsidies, and daily special sales. We focus on discounts and shopping experience, through data insight, mining the value of data, and establishing consumer operation and supply operation solutions for both ends of the marketing field and service supply and demand. We work together with operations and products to create the most price-friendly shopping entrance , the most explosive marketing matrix, making goods and mental operations efficient and certain!
¤ Extended reading ¤
3DXR Technology | Terminal Technology | Audio and Video Technology