Data Types for Machine Learning Exploratory Data Analysis

Data type is an important concept in statistics, we need to have a correct understanding of it in order to use the correct data type to draw conclusions. This post will introduce several data types for exploratory data analysis for machine learning in order to properly grasp and utilize the data.
A good understanding of the data structure is very important for exploratory analysis in machine learning. For different data types, we need different statistical measures for analysis and testing. At the same time, it is also necessary to choose an appropriate visualization method according to the type of data to help us better understand the data. Finally, data types also provide an efficient way to classify variables.
Categorical data
Categorical data represent the attribute characteristics of objects. Such as the gender, language, and nationality of the population are mostly classified data. Categorical data can often also be represented numerically (eg 1 for females and 0 for males), but it should be noted that this numerical value has no mathematical meaning and is merely a label for the classification.
   categorical data
Categorical variables are used to label the characteristics of different variables, and do not need quantitative values, they are just labels. It should be noted that categorical data is unordered, and changing the order of variables will not change the essential characteristics of the data.

The above figure represents the typical categorical data of a sample, which describe the gender and language attributes of the individual respectively. A particular graph is a binary branch with only two properties.
   Sequential data
Ordinal data represent discrete but ordered units of variables. It is a very typed but indeed ordered data organization for categorical data. The educational background data below is a good description of the characteristics of ordinal data.

The four options in the chart above represent, in turn, different educational attainment levels, but cannot quantify the difference between the difference between primary education and high school and the difference between high school and college. The lack of quantification of the difference between features makes ordinal data more useful for evaluating a series of non-numeric features such as sentiment and user satisfaction.
Numerical data
  discrete data
Discrete data refers to discrete values ​​whose values ​​are discontinuous, and the data can only take values ​​at some specific points. Such data cannot be quantitatively measured but can be statistically measured and the information contained in it can be represented in a categorical way. The toss of a coin is the most famous example. We cannot predict the next head or tail of a coin, but we can estimate the probability distribution through statistical historical data.
当处理离散数据时我们需要对两个问题进行深入思考:数据是否可以计数统计,是否可以分割成较小的部分。如果结论于此相关数据可以被测量而不能够计数,那么意味着我们需要处理的便是连续的数据类型。
   连续数据
连续数据类型代表着对象可测量的连续取值,虽然不能够计数但是可以用某种尺度进行连续的测量取值,例如人的身高和年龄便是连续的数值。通常情况下人们只用或者实数来进行表示。
定距数据
定距变量用于表示对象等差属性的描述方法。当我们使用定距变量时我们可以明确的知道数值间的顺序和差别,并计量这种差别。对于温度的描述就是一个定距数据典型的例子。

但定距变量存在的问题在于它没有一个绝对的基准零值,对于上图中的温度来说0度并不意味着没有温度。对于定距变量来说我们可以进行加减操作却无法进行乘除或者比例计算操作。由于不存在绝对零值使得描述性和推理性的统计方法都无法在定距数据上应用。
定比数据
定比数据和定距数据一样都是有序的数据排列,但定比数据存在一个绝对的零值,所描述的都是具有零值基准的变量,包括重量、高度和长度等。

为何数据类型如此重要?
由于不同的统计方法适用于不同的数据类型,所以数据的类型对于统计和机器学习分析十分重要。试想如果利用连续数据的分析方法来研究分类数据,那么十有八九会得出错误的结论。对于数据类型的理解将会有助于我们选择正确的方法和统计模型来探索和分析数据。那么不同的数据类型我们该选择何种统计模型来分析呢?
对于定类数据来说主要需要关注频率、比例/百分比和可视化方法三个要素。用频率度量某一事物在一定时间或者是在数据集中发生的次数。同时可以用频率将其从数据中的占比进行统计和分离。对于这列数据来说饼图和柱状图是最好的呈现方式。

对于定序数据来说除了百分比和频率等指标外,还可以利用百分位数、中位数等统计指标来描述数据。
对于连续数据来说可以利用更为丰富的的手段进行处理,除了常见统计手段的均值和方差外还有峰峰值、范围等指标来进行表示。为了表示数据的误差和离散程度,带有误差棒的箱式图和直方图不失为一种直观的呈现方式。通过箱图可以看到数据的集中程度和误差程度,而直方图则可以提供数据的整体形态、中值、分布以及趋势。

在这篇文章中我们看到除了连续和离散的数值类型外,统计学中还包括了定序数据、定类数据、定距数据和定比数据等类别。 对于不同的数据类型有着不同的分析和可视化方法,在着手处理数据时,理解数据是开始工作的首要条件,不仅有助于我们选择正确的工具和方法,更有助于我们用正确的思维去探索和分析数据,更容易地得出正确有效的结论。
-The End-
来源: Machine Learning Blog
编译:T.R


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326917001&siteId=291194637