box-cox转换及变换参数lambda估算方法

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012735708/article/details/84755595

我们进行数据转换的原因是:除了小样本可以考虑非参数,大部分的统计原理和参数检验都是基于正态分布推得。

关于box-cox转换的基础内容请看:BoxCox-变换方法及其实现运用.pptx

了解极大似然估计:极大似然估计思想的最简单解释

通过上面的内容可以知道,

  • boxcox1p变换中y+c的+c是为了确保(y+c)>0,因为在boxcox变换中要求y>0
  • python代码:
  • y_boxcox = special.boxcox1p(y, lam_best) 利用llf获得优化后的lambda或boxcox_normmax(x) 得到优化后的lambda

boxcox_normmax(x)说明,详情见https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_normmax.html

scipy.stats.boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr')[source]
Compute optimal Box-Cox transform parameter for input data.

Parameters:	
x : array_like 	Input array.
brack : 2-tuple, optional
	The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.
method : str, optional
	The method to determine the optimal transform parameter (boxcox lmbda parameter). Options are:
		‘pearsonr’ (default)
		Maximizes the Pearson correlation coefficient between y = boxcox(x) and the expected values for y if x would be normally-distributed.
		‘mle’
		Minimizes the log-likelihood boxcox_llf. This is the method used in boxcox. ()
		‘all’
		Use all optimization methods available, and return all results. Useful to compare different methods.
		Returns:	
		maxlog : float or ndarray
		The optimal transform parameter found. An array instead of a scalar for method='all'.

接下来,用kaggle中House Prices: Advanced Regression Techniques比赛的数据集做个练习。

scipy.stats.boxcox_llf使用详见https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_llf.html

import pandas as pd
import numpy as np
from scipy import stats,special
import matplotlib.pyplot as plt

train = pd.read_csv('./data/train.csv')
y = train['SalePrice']
print(y.shape)

lam_range = np.linspace(-2,5,100)  # default nums=50
llf = np.zeros(lam_range.shape, dtype=float)

# lambda estimate:
for i,lam in enumerate(lam_range):
    llf[i] = stats.boxcox_llf(lam, y)		# y 必须>0

# find the max lgo-likelihood(llf) index and decide the lambda
lam_best = lam_range[llf.argmax()]
print('Suitable lam is: ',round(lam_best,2))
print('Max llf is: ', round(llf.max(),2))

plt.figure()
plt.axvline(round(lam_best,2),ls="--",color="r")
plt.plot(lam_range,llf)
plt.show()
plt.savefig('boxcox.jpg')

# boxcox convert:
print('before convert: ','\n', y.head())
#y_boxcox = stats.boxcox(y, lam_best)
y_boxcox = special.boxcox1p(y, lam_best)
print('after convert: ','\n',  pd.DataFrame(y_boxcox).head())

# inverse boxcox convert:
y_invboxcox = special.inv_boxcox1p(y_boxcox, lam_best)
print('after inverse: ', '\n', pd.DataFrame(y_invboxcox).head())

 结果如下,

 

比外,也可以通过scipy.stats.boxcox_normplot确定lambda,详见http://scipy.github.io/devdocs/generated/scipy.stats.boxcox_normplot.html

猜你喜欢

转载自blog.csdn.net/u012735708/article/details/84755595