table of Contents
Machine learning about the common processes to solve practical problems:
How to analyze problems
How characteristic works, comparing and selecting common model
How effect evaluation
Various types of machine learning model commonly used in competition fusion techniques
Chapter 1 problem modeling
1.1 Evaluation Index
Category index: precision and recall, ROC and AUC
Regression Indicators: MAE mean absolute error, MAPE MAPE, RMSE RMSE
Sort index: MAP average accuracy mean, NDCG normalization
1.2 Sample Selection
Sample selection of three benefits:
(1) the amount of data is too large: waste of resources; too little: inaccuracies
(2) the correlation is low has no effect on the prediction data
(3) removing the noise data
The method of sample selection: denoising, samples; prototype selected, the training set
1.2.1 Data de-noising
Noise data: characteristic values such as the electricity supplier does not, or does not image classification annotation
( Q applications: the image can be screened through an open interface, and then manually sieve; electricity supplier, the behavior of the available data threshold )
1.2.2 Sampling
Five kinds of sampling methods
(1) simple random sampling without replacement
(2) Simple sampling with replacement
(3) Sampling balance: according to a predefined ratio, the sample recombined.
Such as n 100, 10000 negative predefined 1:10. Then the sampling is: being copied 10 times; negative sampled: delete negative remaining 1000
(4) Cluster Sampling: sample is divided into N cluster, then Randomly s <= N th
(5) layered samples: samples x%, respectively positive and negative samples, positive and negative samples to ensure the same proportion.
1.2.3 prototype selection and training set selection
1.3 cross validation
Aside method, k-fold cross validation, Bootstrap
Chapter 2 feature works
Data and characteristics determine the upper limit of the machine learning algorithms, and models and algorithms just closer and closer to this limit only.
A simple model based on large amounts of data rather than based on a complex model of a small amount of data.
More data is better than clever algorithms, and much better than good data.
2.1 Feature Extraction
The first step in the project features: understanding business data and business logic
Common statistical characteristics: frequency ratio statistics features (mean, peak, etc. quantile)
2.1.1 Exploratory Data Analysis
EDA: Exploratory Data Analysis Exploratory analysis of data. It is divided into visualization, quantitative analysis of the two.
2.1.2 数值特征
处理方法:截断、二值化、分桶(分值/分位数)、缩放、缺失值处理(补/忽略)、
特征交叉(组合,加减乘除;FM/FFM:自动进行特征交叉组合)
非线性编码(多项式核等)、行统计量
2.1.3 类别特征
自然数编码、独热编码、分层编码(比如身份证等)、散列编码、计数编码、计数排名编码、目标编码
2.1.4 时间特征
2.1.5 空间特征
2.1.6 文本特征
语料构建、文本清洗、分词、词袋/N-Gram、Skip-Gram等
2.2 特征选择
特征选择的目的:简化模型(使模型更易理解)、改善性能(节省存储和计算开销)、改善通用性,降低过拟合风险
特征选择的过程:产生过程,评价函数,停止准则,验证过程
特征选择的方法:过滤方法,封装方法,嵌入方法
第3章 常用模型
3.1 逻辑回归
3.2 场感知因子分解机
3.3 梯度提升树
第4章 模型融合
4.1 理论分析
融合收益,模型误差-分歧分解,模型多样性度量,多样性增强
4.2 融合方法
平均法,投票法,bagging,stacking