《Practical Statistics for Data Scientists》Chapter 1: Exploratory Data Analysis

申明：本书使用 code 为 R。

本章主要总结一下几个内容：

对Data基本属性的挖掘，如：location，variability；
以图形方式，挖掘Data属性，如 data distribution，correlation；

一、对Data基本属性的挖掘

1、Estimate of location

评估Data location的几个指标汇总如下：

指标	公式	优缺点
Mean		对异常值敏感
Trimmed Mean		trimmed mean可以去除异常值的影响
Weighted mean		weighted mean并不能排出异常值的影响，它主要用于如下情况：对于不同的sample由于某些原因，对其 value 的准确性有所质疑，通过对不同sample分配不同的权重，来表达对不同sample的信任程度
Median	The middle number on a sorted list of the data	相比Mean来说，其对“异常值”的鲁棒性更强
Weighted Median	将各个sample_value * weight，sort，the weighted median is a value such that the sum of weights is equal for the lower and upper halves of the sorted list	weighted median is robust to outliers

在上述几个指标中，Trimmed mean 和 median 均对 outliers 具有较强的鲁棒性！

2、Estimate of variability

评估Ddata variability 的指标汇总如下：

指标	公式	优缺点
Mean absolution deviation		sensitive to outliers
Variance		sensitive to outliers
Standard deviation		sensitive to outliers
Median absolute deviation		robust to outliers；Sometimes, the median absolute deviation is multiplied by a constant scaling factor (it happens to work out to 1.4826) to put MAD on the same scale as the standard deviation in the case of a normal distribution.
Trimmed standard deviation	like trimmed mean	robust to outliers
range	the largest - the smallest	sensitive to outliers, not useful as a measure of dispersion in the data
percentile	the common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR).	robust to outliers

二、以图形方式挖掘Data属性

1、Exploring data distribution

data distribution 可以从如下几个角度去分析：

location；
variability；
skewness(Skewness refers to whether the data is skewed to larger or
smaller values)；
kurtosis(kurtosis indicates the propensity of the data to have extreme values).

下面给出可以用来分析 data distribution的几种 plot：

Percentile

quantile(state[["Murder.Rate"]], p=c(.05, .25, .5, .75, .95))
5% 25% 50% 75% 95%
1.600 2.425 4.000 5.550 6.510

Boxplot

boxplot(state[["Population"]]/1000000, ylab="Population (millions)")

对 Boxplot 的解释如下：

violin plot
violin plot 在 y轴上的意义与 Boxplot 相同，在 x轴上，表示特定数值(y)的concentration：

ggplot(data=airline_stats, aes(airline, pct_carrier_delay)) +
ylim(0, 50) +
geom_violin() +
labs(x="", y="Daily % of Delayed Flights")

Frequency table

breaks <- seq(from=min(state[["Population"]]),
to=max(state[["Population"]]), length=11)
pop_freq <- cut(state[["Population"]], breaks=breaks,
right=TRUE, include.lowest = TRUE)
table(pop_freq)

Histogram

hist(state[["Population"]], breaks=breaks)

Density estimate

hist(state[["Murder.Rate"]], freq=FALSE)
lines(density(state[["Murder.Rate"]]), lwd=3, col="blue")

2、Exploring Binary and Categorical Data

Bar plot
pie chart

Note that：可以用mode(最长出现的category) 或 expected value(针对 numerical category 而言) 来描述 category data。

3、Correlation

可以用 correlation matrix 或 scatterplot 来描述变量之间的相关关系：

correlation matrix

etfs <- sp500_px[row.names(sp500_px)>"2012-07-01",
sp500_sym[sp500_sym$sector=="etf", 'symbol']]
library(corrplot)
corrplot(cor(etfs), method = "ellipse")

scatterplot

plot(telecom$T, telecom$VZ, xlab="T", ylab="VZ")

Note that：我们可以利用 Pearson correlation coefficience 来描述变量之间的 “线性”相关关系，Pearson correlation coefficience 公式如下：

Sx，Sy 为 x，y 的标准差。N为sample个数。

4、利用plot挖掘 two variable or more variable 之间的关系

Hexagonal Binning plot
scatterplot是将所有的sample以point的形式绘制于二维平面，这种绘制方式仅适用于 small data_set（hundreds of data），但是对于large data_set（Hundreds of thousands of data），以scatterplot绘制图形，point密度会很大，从而只能得到一团黑云。为了改善scatterplot的这种缺陷，我们引进了“Hexagonal Binning plot”，其核心思想是：we grouped the records into hexagonal bins and plotted the hexagons with a color indicating the number of records in that bin.
下面利用R绘制 Hexagonal Binning plot：

ggplot(kc_tax0, (aes(x=SqFtTotLiving, y=TaxAssessedValue))) +
stat_binhex(colour="white") +
theme_bw() +
scale_fill_gradient(low="white", high="black") +
labs(x="Finished Square Feet", y="Tax Assessed Value")

Contour plot
Contour plot 中每一个 “线圈” 都是一个等密度线，等密度线之间的差值相等。contour plot中等密度线越密集，说明这部分point密度越大，否则，越稀疏。

ggplot(kc_tax0, aes(SqFtTotLiving, TaxAssessedValue)) +
theme_bw() +
geom_point( alpha=0.1) +
geom_density2d(colour="white") +
labs(x="Finished Square Feet", y="Tax Assessed Value")

Contingency table
Contingency table 用于总结两个 “Categorical variable” ：

library(descr)
x_tab <- CrossTable(lc_loans$grade, lc_loans$status,
prop.c=FALSE, prop.chisq=FALSE, prop.t=FALSE)
#两个categorical variable:
# variable1:Grade :A ,B ,C ,D ,E
#variable2：      ：Fully paid, Current Late, Charged off

visualize multiple variables

ggplot(subset(kc_tax0, ZipCode %in% c(98188, 98105, 98108, 98126)),
aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex(colour="white") +
theme_bw() +
scale_fill_gradient( low="white", high="blue") +
labs(x="Finished Square Feet", y="Tax Assessed Value") +
facet_wrap("ZipCode")
#除x,y轴外，还有一个conditioning：ZipCode={98105,98108,98126,98188}

在这里插入图片描述