Data Analysis (3) - Data Description

  We describe the scale and average data in the previous article, but only to describe the data is not enough, also need more metrics describing data through them.

Measure center

  Measures have been introduced in the previous chapter center (measure of center), the center is also referred to measure balance point data, the data can be summarized to some extent.

  Although the measure to a data center and easy way is to describe, but there are many limitations. The following table is two basketball players to score in the last month of the game:

  

  

  Score table consciously score from lowest to highest. The following code calculates the mean and median of A and B:

1 import numpy as np
2
3 A = [7, 8, 9, 9, 10, 10, 11, 11, 12, 13]
4 B = [3, 3, 4, 6, 7, 10, 10, 10, 13, 13, 31]
5 mu_A, mu_B = np.mean(A), np.mean(B) # 均值
6 median_A, median_B = np.median(A), np.median(B) # 中位数
7 print('mu_A = {}, mu_B = {}'.format(mu_A, mu_B)) # mu_A = 10.0, mu_B = 10.0
8 print('median_A = {}, median_B = {}'.format(median_A, median_B)) # median_A = 10.0, median_B = 10.0

  A mean value is more properly described, but not necessarily the B. B of excessive outliers, these data will greatly affect the mean. Although the median is less influence of outliers, but still can not completely describe the characteristics of B, then we need to seek other indicator data.

Data from

Full data from the

  Center measure used to quantify the data center, the data trickle is used to quantify the degree of dispersion of data.

  Full distance calculation is very simple, using the maximum value minus the minimum data, nothing more, it's just a very basic description of the degree of dispersion of data.

1 A = [7, 8, 9, 9, 10, 10, 11, 11, 12, 13]
2 B = [3, 3, 4, 6, 7, 10, 10, 10, 13, 13, 31]
3 range_A, range_B = max(A) - min(A), max(B) - min(B) # 全距
4 print('range_A = {}, range_B = {}'.format(range_A, range_B)) # range_A = 6, range_B = 28

  Minimum and maximum data respectively from the whole lower and upper bounds, that is, the distance between the two full-pitch data:

  Whole pitch is derived from the extreme value calculated data, only measures the width of the data, and does not indicate whether the data contains outliers, so the whole scene from the use of very limited, very often use the full distance simply because it is simple .

  Full from some typical usage scenario: In the analysis of algorithms, although we use big O shows the expression efficiency of the algorithm in the average case law, but we are the best and efficient algorithm in the worst case there is still a lot of interest; when the task is to assess software project, an important indicator is the "worst-case completion time", after all, the project schedule is not always so encouraging. It seems that the whole pitch is not so good for nothing.

Interquartile range

  First, it is clear that the quarterback and the quarterback is not the slightest relationship.

  Since the full-pitch very susceptible to the influence of outliers, then ignore outliers can not it do? For the B players, you can ignore the state and score good and very poor. And now again the player C, his scores:

  

    31 minutes far higher than other sessions, so ignore 31 points. Now the question arises, B and C players use different players to ignore outliers way --B ignored lowest rating, while C does not compare different ways of handling data, data analysis is taboo.

   Ignore outliers A better way is to use the interquartile range. First, sort the data, then the data is divided into four parts:

  From the distribution point of view, interquartile retained 50% of the mean near the middle of the data:

  

  To outliers in addition to two sets of data with a unified standard.

  The method of calculating the lower quartile and the upper quartile median and the like. N data data sets, for the lower quartile, if n / 4 is an integer, the lower quartile is n / 4 positions and n / 4 + 1 average of two position numbers; if n / 4 is not an integer, rounding up, the number is lower quartile position. For the upper quartile, if 3n / 4 is an integer, the lower quartile average of two numbers 3n / 4 position and 3n / 4 + 1 position; if 3n / 4 is not an integer, up rounding, the position number is lower quartile.

  Player B a total of 11 points, 11 ÷ 4 = 2.75, rounding up, the lower quartile is the third data set, the lower quartile is 4; with 11 × 3 ÷ 4 = 8.25, rounding up , upper quartile ninth dataset, the lower quartile 13. The player is a quarter pitch 13--4 = 9.

1 A = [7, 8, 9, 9, 10, 10, 11, 11, 12, 13]
2 B = [3, 3, 4, 6, 7, 10, 10, 10, 13, 13, 31]
3 lower_A = np.quantile(A, 1/4, interpolation='lower')  # 下四分位数
4 higher_A = np.quantile(A, 3/4, interpolation='higher')  # 上四分位数
5 lower_B = np.quantile(B, 1/4, interpolation='lower')
6 higher_B = np.quantile(B, 3/4, interpolation='higher')
7 print('lower_A = {}, higher_A = {}'.format(lower_A, higher_A)) # lower_A = 9, higher_A = 11
8 print('lower_B = {}, higher_B = {}'.format(lower_B, higher_B)) # lower_B = 4, higher_B = 13

  Of course, you can put any data into blocks, divided into 100 such as, for dividing this ranking is useful. Suppose a student's SAT score is 600 points, from a single score can not know good or bad, but if the first 90 percent of this year's college entrance examination is 590 points, it is possible to know the candidates scores more than 90% of the students.

Boxplot

   We often see the box plot:

  Know quartiles, and interquartile ranges, not difficult to understand boxplot:

  FIG box also shows the full-pitch, quarter and from the median data, data can be skewed by understand the extent box of FIG.

1 import matplotlib.pyplot as plt
2 plt.boxplot([A, B], labels = ['player A', 'playerB']) # 箱形图
3 plt.show()

  

  playerB above there is a small circle, which indicates an abnormal value, B 31 points player game is judged to be an abnormal value, it will automatically box the exclusion of FIG.

   Under the assumption of normal distribution, μ-3σ <= x < = μ + 3 [sigma] region contains most of the data, the data outside the region is considered to be abnormal (for more information refer to: HTTPS: // mp.weixin.qq.com/s/DgiLzv5sOAS7JeUDk-6fLA ):

  For the box plot, it is set lower quartile, heigher is the upper quartile, range is the interquartile range, x is a data sample, the following inequality is determined whether x is an abnormal value:

  For the B players:

  

  -9.5 or 26.5 is outside the values ​​are 31, 31 is thus determined to be abnormal data.

Talk about standard deviation

  Variance, variance and covariance are (probability 8) has been discussed standard deviation, a measure of the volatility of data, i.e., the quantized data points the degree of deviation of the mean. Two players standard difference calculated by the following code:

1 A = [7, 8, 9, 9, 10, 10, 11, 11, 12, 13]
2 B = [3, 3, 4, 6, 7, 10, 10, 10, 13, 13, 31]
3 sigma_A, sigma_B = np.std(A), np.std(B) # 标准差
4 print('σ_A = {}, σ_B = {}'.format(sigma_A, sigma_B)) # σ_A = 1.7320508075688772, σ_B = 7.49545316720865

  球员A的标准差表示A样本数据的离散程度,可以认为σA近似于A中所有数据点与均值间距离的平均值。样本越分散,远离均值的样本越多,标准差越大。标准差是有单位的,其单位和计算标准差的数据单位一致,A的标准差是球员的得分数。

  可以通过柱状图观察数据和标准差的关系:

 1 def add_bar(data, mu, sigma, name):
 2     '''
 3     添加柱状图
 4     :param data: 数据集
 5     :param mu: 均值
 6     :param sigma:  标准差
 7     :param name: 数据集名称
 8     '''
 9     length = len(data)
10     plt.bar(left=range(length), height=data, width=0.4, color='green', label=name)
11     plt.plot((0, length), (mu, mu), 'r-', label='均值') # 均值线
12     plt.plot((0, length), (mu + sigma, mu + sigma), 'y-', label='均值+标准差') # 均值线
13     plt.plot((0, length), (mu - sigma, mu - sigma), 'b-', label='均值-标准差') # 均值线
14     plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
15     plt.legend(loc='upper left')
16     plt.title('bar of ' + name)
17
18 fig = plt.figure()
19 fig.add_subplot(1, 2, 1)
20 add_bar(A, mu_A, sigma_A, 'A')
21 fig.add_subplot(1, 2, 2)
22 add_bar(B, mu_B, sigma_B, 'B')
23 plt.show()

  中间红线是均值。黄线和蓝线分别表示均值加减标准差,标准差越大,黄线和蓝线之间的距离也越大,说明数据的离散程度越大。在两条线之外的数据被认为是离群数据。

变异测度

  测度中心和标准差都可以描述数据集的特征,二者都是有单位的,如果想要比较两个不同的数据集,特别对不同尺度的数据集进行横向比较,就需要有一种方式去掉单位。

  变异系数(coefficient of variation)是标准差和均值的比率,二者相除去掉了单位,并对标准差进行了标准化处理,这种测度也称为变异测度。

  假设下表是NBA中锋和控球后卫的身高数据:

  可以看到,中锋的平均身高较高,但变异系数只有2.4%,说明作为球队中最高大的队员,各中锋之间的身高差距不大。控球后卫虽然平均身高更接近普通人,但变异系数是8.0%,说明各球队控球后卫的身高差异也较为明显,该位置对身高的要求相对较弱。

z分数(z-scorce)

  z分数也称相对分数,用于描述单个数据点和均值之间的距离。数据点和z分数的计算方法是:

  x(i)表示第i个数据点,σ是样本的标准差,带上帽子的x是均值。

  标准差近似于所有数据点与均值间距离的平均值,z分数是单个数据点和均值间的距离,更确切地说,是标准化后单个数据点和均值间的距离。

1 z_source_A = (np.array(A) - mu_A) / sigma_A
2 z_source_B = (np.array(B) - mu_B) / sigma_B
3 print('z_source_A =', z_source_A)
4 print('z_source_B =', z_source_B)

  z_source_A = [-1.73205081 -1.15470054 -0.57735027 -0.57735027 0. 0. 0.57735027 0.57735027 1.15470054 1.73205081]

  z_source_B = [-0.9338995 -0.9338995 -0.80048529 -0.53365686 -0.40024264 0. 0. 0. 0.40024264 0.40024264 2.80169851]

   对于的z分数来说,均值的z分数是0,均值加标准差的z分数是1,均值减标准差的z分数是-1:

 1 def add_z_bar(z_source, name):
 2     ''' 添加z_source柱状图 '''
 3     length = len(z_source)
 4     plt.bar(left=range(length), height=z_source, width=0.4, color='green', label=name)
 5     plt.plot((0, length), (0, 0), 'r-', label='z_source of μ')
 6     plt.plot((0, length), (1, 1), 'b-', label='z_source of μ')
 7     plt.plot((0, length), (-1, -1), 'y-', label='z_source of μ')
 8
 9     plt.title('bar of ' + name)
10 fig = plt.figure()
11 # plt.subplots_adjust(wspace=0.5, hspace=0.5)
12 fig.add_subplot(2, 1, 1)
13 add_z_bar(z_source_A, 'z source of A')
14 fig.add_subplot(2, 1, 2)
15 add_z_bar(z_source_B, 'z source of B')
16 plt.show()

  低于均值的数据,z分数是负值;高于均值的数据,z分数是正值;等于z分数的数据,均值为0。下图可以看出z分数和均值的关系:

相关系数

  相关系数是描述两个变量间关联性强弱的量化指标。数据的各个特征之间存在关联关系是机器学习模型的重要假设,预测能够成立的原因正是由于特征间存在某种相关性。

  相关系数的值介于-1到1之间。两个特征间的关系越强,相关系数越接近±1;关系越若,越接近0。接近+1,表示一个指标增加了,另一个也随之增加;接近-1,表示一个指标增加了,另一个指标将降低。

  成年男性的脚长约等与身高的1/7,下面的代码生成了200个身高和脚长的正态分布数据:

 1 import numpy as np
 2 import matplotlib.pyplot as plt
 3 import pandas as pd
 4
 5 def create_data(count=200):
 6     '''
 7     构造2维训练集
 8     :param model: train:训练集,  test:测试集
 9     :param count: 样本数量
10     :return: X1:身高维度的列, X2:脚长维度的列表
11     '''
12     np.random.seed(21)  # 设置seed使每次生成的随机数都相等
13     X1 = np.random.normal(1.7, 0.036, count)  # 生成200个符合正态分布的身高数据
14     low, high = -0.01, 0.01
15     # 设置身高对应的脚长,正常脚长=身高/7±0.01
16     X2 = X1 / 7 + np.random.uniform(low=low, high=high, size=len(X1))
17     return X1, X2
18
19 X1, X2 = create_data()
20 df = pd.DataFrame({'height':X1, 'foot':X2})
21 print(df.head()) # 显示前5个数据

   height foot

  0 1.698129 0.250340

  1 1.695997 0.233121

  2 1.737505 0.247780

  3 1.654757 0.238051

  4 1.726834 0.241137

  身高和脚长的维度不同,1厘米对于身高来说相差不大,但对于脚长来说就很大了。为了寻找关联关系,需要对两个维度进行标准化处理,将二者压缩到统一尺度。

1 from sklearn import preprocessing
2
3 # 使用z分数标准化
4 df_scaled = pd.DataFrame(preprocessing.scale(df), columns=['height_scaled', 'foot_scaled'])
5 print(df_scaled.head())

  sklearn的preprocessing使用了z分数标准化,结果如下:

   height_scaled foot_scaled

  0 -0.051839 0.959658

  1 -0.110551 -1.389462

  2 1.032336 0.610501

  3 -1.246054 -0.716968

  4 0.738525 -0.295843

  现在可以看看两个维度的相关系数:

corr = df_scaled.corr() # 两个维度的相关系数
print(corr)

   height_scaled foot_scaled

  height_scaled 1.000000 0.614949

  foot_scaled 0.614949 1.000000

  这个结果告诉我们,身高和脚长关联关系,由于corr()分析的是线性相关,因此即使相关系数为0,也不能明两个特征间没有关系,只能说不存在线性关系。

  


  作者:我是8位的

  出处:https://mp.weixin.qq.com/s/ysMdUdcAk9BuXNH9bvqOBg

  本文以学习、研究和分享为主,如需转载,请联系本人,标明作者和出处,非商业用途! 

  扫描二维码关注作者公众号“我是8位的”

Guess you like

Origin www.cnblogs.com/bigmonkey/p/11842976.html
Recommended