A data set
Wherein: column data set
Sample: line data set
Spatial characteristics consisting of: feature or attribute space
Feature or attribute vector: characteristic or attribute of a point in space
Training set: the data used to train the model, using the training set + algorithms constitute model to solve practical problems
Test set: the data model used to test the effect. The ratio of training set and test data sets is usually 6: 4,7: 3,8: 2
2 non-numerical conversion characteristics
label encoding tag encoding
onehot encoding one-hot encoding
3 performance matrix (confusion matrix)
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Accuracy rate: TP / (TP + FP)
The real rate (recall) TPR: TP / (TP + FN)
False positive rate FPR: FP / (FP + TN)
F1-score(F1值):2/(1/TPR+1/FPR)
4 machine learning framework deal
The data set is divided into training and test sets
According to the training model training set
Test set by the test model, an evaluation
5 machine learning classifier
Learn the difference between supervised and unsupervised learning is whether there is a class label.
Supervised learning :
Categories: Tags discrete values
Return: Label continuous value
Unsupervised Learning :
Clustering: by similarity between feature
Dimensionality reduction: to achieve the purpose of dimensionality reduction by machine learning algorithms, different feature selection
Semi-supervised learning :
Active learning: experts to label unlabeled data
Pure semi-supervised learning / transductive: the characteristic data and a label wherein the label-free data together, the data classified according clustering mode, in the same class, according to the principle of majority rule, the unlabeled data corresponding to most of the label.
Reinforcement Learning :
Mainly used to solve the problem of continuous decision-making, for the positive reward good performance and poor performance of the bear bonus.
Transfer learning :
Small data issues: two related fields, data, less data, the data can be modeled in a number of areas, for small data fields.
Personalized problem
6 three elements of machine learning
Machine learning algorithms + = + data strategy
Machine learning algorithms model + = + Strategy
Algorithm: provides a method of solving the parameters, there are analytical method, numerical method
Strategy: loss function, loss function as small as possible, which is a function of expected loss as small as possible, expect p (x, y) not solved, empirical risk minimization alternatives. Adding a positive experience on the risk of penalty, that is, structural risk.
Model: decision function (output 0 or 1), conditional probability function (in accordance with the output conditions)
7 how to design a machine learning system
First clear:
The problem is not a machine learning problem?
The question is what kind of problem in machine learning? Supervised learning, unsupervised learning
After thinking two ways to get the data:
From the data point of view, the problem with supervised learning or unsupervised learning
From a business point of view, organize data, modeling
Project features:
Processing features
Processing of data
Algorithms + Data selection - Model
By testing the test model set, given final parameters
If new data, the forecast
8 model generalization
category |
Underfitting |
Overfitting |
Feature |
On the training set and test set performance is not good |
On the training set a very good performance, poor performance on the test set |
the reason |
1 model is too simple |
1 model is too complex 2 The data corruption 3 the amount of data is too small |
Time appears |
Training early |
Training late |
How to deal with |
1 increase polynomial terms 2 increase in the number of items in the polynomial 3 of penalty reduction regularization |
1 model for complex features, increase the regularization of penalty 2 re-cleaning data 3 to increase the amount of data 4 wherein the sample sampling or sampling Discarding some random point 5 dropout- |
9 Occam's razor
Two quite generalization model, select a model that is relatively simple to use.
10 regularization
L1 regularization: + lambda * | w |
Regular L2: + lambda * | w | ^^ 2
11 cross-validation
Simple cross-validation: the data set cut into 6: 4,7: 3,8: 2
K the cross-validation: the data is divided into k equal parts, in which as a test set, the rest of the training set of training models K, average accuracy
Leaving a verification: the special cross-validation K