US group Machine Learning Practice (1) General Process

table of Contents

Chapter 1 problem modeling

1.1 Evaluation Index

1.2 Sample Selection

1.3 cross validation

Chapter 2 feature works

2.1 Feature Extraction

2.2 Feature Selection

Chapter 3 Common Model

Chapter 4 model fusion

4.1 Theoretical analysis

Fusion 4.2


 

Machine learning about the common processes to solve practical problems:

How to analyze problems

How characteristic works, comparing and selecting common model

How effect evaluation

Various types of machine learning model commonly used in competition fusion techniques

 

 

Chapter 1 problem modeling

1.1 Evaluation Index

Category index: precision and recall, ROC and AUC

Regression Indicators: MAE mean absolute error, MAPE MAPE, RMSE RMSE

Sort index: MAP average accuracy mean, NDCG normalization

 

1.2 Sample Selection

Sample selection of three benefits:

(1) the amount of data is too large: waste of resources; too little: inaccuracies

(2) the correlation is low has no effect on the prediction data

(3) removing the noise data

The method of sample selection: denoising, samples; prototype selected, the training set

 

1.2.1 Data de-noising

Noise data: characteristic values ​​such as the electricity supplier does not, or does not image classification annotation

( Q applications: the image can be screened through an open interface, and then manually sieve; electricity supplier, the behavior of the available data threshold )

 

1.2.2 Sampling

Five kinds of sampling methods

(1) simple random sampling without replacement

(2) Simple sampling with replacement

(3) Sampling balance: according to a predefined ratio, the sample recombined.

Such as n 100, 10000 negative predefined 1:10. Then the sampling is: being copied 10 times; negative sampled: delete negative remaining 1000

(4) Cluster Sampling: sample is divided into N cluster, then Randomly s <= N th

(5) layered samples: samples x%, respectively positive and negative samples, positive and negative samples to ensure the same proportion.

 

1.2.3 prototype selection and training set selection

 

1.3 cross validation

Aside method, k-fold cross validation, Bootstrap

 

Chapter 2 feature works

Data and characteristics determine the upper limit of the machine learning algorithms, and models and algorithms just closer and closer to this limit only.

A simple model based on large amounts of data rather than based on a complex model of a small amount of data.

More data is better than clever algorithms, and much better than good data.

 

2.1 Feature Extraction

The first step in the project features: understanding business data and business logic

Common statistical characteristics: frequency ratio statistics features (mean, peak, etc. quantile)

2.1.1 Exploratory Data Analysis

EDA: Exploratory Data Analysis Exploratory analysis of data. It is divided into visualization, quantitative analysis of the two.

2.1.2 数值特征

处理方法:截断、二值化、分桶(分值/分位数)、缩放、缺失值处理(补/忽略)、

特征交叉(组合,加减乘除;FM/FFM:自动进行特征交叉组合)

非线性编码(多项式核等)、行统计量

2.1.3 类别特征

自然数编码、独热编码、分层编码(比如身份证等)、散列编码、计数编码、计数排名编码、目标编码

2.1.4 时间特征

2.1.5 空间特征

2.1.6 文本特征

语料构建、文本清洗、分词、词袋/N-Gram、Skip-Gram等

2.2 特征选择

特征选择的目的:简化模型(使模型更易理解)、改善性能(节省存储和计算开销)、改善通用性,降低过拟合风险

特征选择的过程:产生过程,评价函数,停止准则,验证过程

特征选择的方法:过滤方法,封装方法,嵌入方法

 

第3章 常用模型

3.1 逻辑回归

3.2 场感知因子分解机

3.3 梯度提升树

 

第4章  模型融合

4.1 理论分析

融合收益,模型误差-分歧分解,模型多样性度量,多样性增强

4.2 融合方法

平均法,投票法,bagging,stacking

Guess you like

Origin blog.csdn.net/weixin_41770169/article/details/93229577