Statistical modeling - Study Notes

Note: Reference video tutorial --- Netease cloud classroom "statistical modeling made easy," Zhang Wentong

  • Traditional model:

y=f(x,\theta )+\varepsilon,

y: the dependent variable; X: independent variable;   \theta: unknown parameters;   \varepsilon: perturbation function.

Among them, the first item plus the impact of the independent variable on the dependent variable, reflecting a common characteristic; while the second is an addition to reflect the personality characteristics.

In statistical modeling, the first expression of the need to add item obtained, and according to the distribution, estimate the unknown parameters.

The traditional model of drawbacks:

  1. The display can only be used to solve simple expressions and more complex function can not be expressed;
  2. The situation can only be used from independent and dependent variables distinguishable

  • Measurement scale variables:

Measurement scales: With what degree of precision to measure interest.

  1. Nominal Scale: The minimum amount of information, = Multinomial
  2. Ordinal scale: = orderly classification, can not measure the difference between class and class number;
  3. Scale scales: the gap between the measure, no fixed scale from absolute zero, only for addition and subtraction; fixed ratio scale, there is absolutely zero to do subtraction, multiplication and division.
     
level Variable Types
Name Level

Given the class variables

Order level Ordinal variable
Bay level Fixed pitch variable
Proportion level The proportion of variable

 

Wherein a given class may be called a variable, random variables; ordinal variable, called the variable sequence; spacer variable, variable fixed-ratio, is called quantitative variables.

Lower level from top to bottom, it is worth noting: high-level variables can discard some of the information is converted into low-level, such as: a class test scores from 0-100, originally part of the set than the variable, but can be divided 0-60,60-80,80-100 were set to fail, qualified, excellent three types, belonging to the order of the variables (ordinal variable); abandon further information: more than 60 in mind, "good", or mind " bad ", it belongs to a given class variables.

The low level can not be converted to a high level, because this process requires people to add information, often inaccurate.


  • Model Category:

If the independent and dependent variables can be distinguished:

(Survival analysis models Note: survival outcome.)

If the independent and dependent variables can not be distinguished:

  • According to the purpose of the classification:

Clustering Methods: Application market segmentation, collaborative recommendation

Prediction: regression model, time series model

Associated inductive method: market basket analysis, sequence analysis

  • According to the principle method for classification:

1. inference method based on traditional statistical models

在抽样理论的支持下,首先假定预测比那辆和应i选哪个因素间诚信啊某冲公式化的联系,然后采用假设检验的方法来验证相应的假设是否成立,并给出相应的参数估计值。

2. 基于机器识别基数的自动化方法

非推断性方法,没有前提假设,直接从数据集中寻找关联,后采用验证数据集对找到的关联加以验证。

 


  • 损失函数:

损失函数:衡量模型的信息损失或是预测错误程度的函数。
模型拟合的最终目标:损失函数最小。

对不同类型的变量,常见的损失函数有:

  1. 对分类变量:错分比例,分类预测正确性,熵;
  2. 对连续变量:残差所代表的信息量的综合及其所导致的损失,最小乘法中的残差平方和,离均值绝对值之和(最小一乘法)。

注意:因为因子分析和主成分分析没有目标,所以也就不存在损失函数。
有监督的学习,才需要损失函数。

凸函数,convex function ,局部最小值是全局最小值。比如图一,图二。

非凸函数,局部最小值不是全局最小值,如图三。

要尽量把损失函数构造成凸函数,这样一来,求最小值较为容易---此时最小值就是极小值。


  • 控制模型的复杂程度:惩罚项

惩罚,即扣分。

在理想的损失函数的基础上加一个惩罚项,用于表达模型的复杂程度,以避免一味地追求精确而使得模型过于复杂。

  • 由来:

将原模型:原损失函数 = 模型精确性衡量指标;

修正为:新损失函数  = 模型精确性衡量指标 + 模型复杂度衡量指标;

但是,考虑到在不同的实际应用中所要求的精确和复杂也许不是同等地位的,于是加权,进一步地修正如下:

原损失函数 = 模型精确性衡量指标 + \lambda \cdot模型复杂度衡量指。

  • 正则化的别名:
  1. 在机器学习中,正则化(regularization);
  2. 在统计学领域,模型惩罚项(penalty);
  3. 在数学上,范数(norm);
  • 基本作用:

Ensure that the model as simple as possible, to avoid too many parameters lead to over-fitting; constrained model characteristics, adding some prior knowledge, such as sparse, low rank. Regularization function complexity of the model is typically a monotonically increasing function, the more complex model, the greater the cost.

  • Several common regularization / penalty term / norm type:

L0 Regularization: complex index is the number of non-zero model parameters; easy to understand, but it is difficult to solve the mathematical;

L1 regularization: absolute value for each parameter (weighted) sum of models, such as the geometrical Manhattan distance (block distance, I think that the respective components do difference, then take the absolute value of that distance), it is mainly used for feature selection / screening variables, instance: Lasson return.

L2 Regularization: square of the parameters in the model (weighted) sum of the square root, i.e., the Euclidean distance, is mainly used to prevent overfitting Examples: ridge regression.

Ln Regularization: n-th power for the parameters (weighted) sum of n-th power of the open model.

 

Published 109 original articles · won praise 30 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43448491/article/details/103933014