Introduction to Machine Learning -day2

A data set

Wherein: column data set

Sample: line data set

Spatial characteristics consisting of: feature or attribute space

Feature or attribute vector: characteristic or attribute of a point in space

Training set: the data used to train the model, using the training set + algorithms constitute model to solve practical problems

Test set: the data model used to test the effect. The ratio of training set and test data sets is usually 6: 4,7: 3,8: 2

2 non-numerical conversion characteristics

    label encoding tag encoding

    onehot encoding one-hot encoding

3 performance matrix (confusion matrix)

 

 

     Accuracy: (TP + TN) / (TP + TN + FP + FN)

    Accuracy rate: TP / (TP + FP)

    The real rate (recall) TPR: TP / (TP + FN)

    False positive rate FPR: FP / (FP + TN)

    F1-score(F1值):2/(1/TPR+1/FPR)

4 machine learning framework deal

    The data set is divided into training and test sets

    According to the training model training set

    Test set by the test model, an evaluation

5 machine learning classifier

    Learn the difference between supervised and unsupervised learning is whether there is a class label.

    Supervised learning :

Categories: Tags discrete values

Return: Label continuous value

    Unsupervised Learning :

        Clustering: by similarity between feature

        Dimensionality reduction: to achieve the purpose of dimensionality reduction by machine learning algorithms, different feature selection

    Semi-supervised learning :

Active learning: experts to label unlabeled data

Pure semi-supervised learning / transductive: the characteristic data and a label wherein the label-free data together, the data classified according clustering mode, in the same class, according to the principle of majority rule, the unlabeled data corresponding to most of the label.

    Reinforcement Learning :

Mainly used to solve the problem of continuous decision-making, for the positive reward good performance and poor performance of the bear bonus.

    Transfer learning :

        Small data issues: two related fields, data, less data, the data can be modeled in a number of areas, for small data fields.

        Personalized problem

6 three elements of machine learning

    Machine learning algorithms + = + data strategy

    Machine learning algorithms model + = + Strategy

        Algorithm: provides a method of solving the parameters, there are analytical method, numerical method

        Strategy: loss function, loss function as small as possible, which is a function of expected loss as small as possible, expect p (x, y) not solved, empirical risk minimization alternatives. Adding a positive experience on the risk of penalty, that is, structural risk.

        Model: decision function (output 0 or 1), conditional probability function (in accordance with the output conditions)

7 how to design a machine learning system

    First clear:

        The problem is not a machine learning problem?

        The question is what kind of problem in machine learning? Supervised learning, unsupervised learning

    After thinking two ways to get the data:

        From the data point of view, the problem with supervised learning or unsupervised learning

        From a business point of view, organize data, modeling

    Project features:

        Processing features

        Processing of data

    Algorithms + Data selection - Model

    By testing the test model set, given final parameters

    If new data, the forecast

8 model generalization

category

Underfitting

Overfitting

Feature

On the training set and test set performance is not good

On the training set a very good performance, poor performance on the test set

the reason

1 model is too simple

1 model is too complex

2 The data corruption

3 the amount of data is too small

Time appears

Training early

Training late

How to deal with

1 increase polynomial terms

2 increase in the number of items in the polynomial

3 of penalty reduction regularization

1 model for complex features, increase the regularization of penalty

2 re-cleaning data

3 to increase the amount of data

4 wherein the sample sampling or sampling

Discarding some random point 5 dropout-

9 Occam's razor

    Two quite generalization model, select a model that is relatively simple to use.

10 regularization

    L1 regularization: + lambda * | w |

    Regular L2: + lambda * | w | ^^ 2

11 cross-validation

    Simple cross-validation: the data set cut into 6: 4,7: 3,8: 2

    K the cross-validation: the data is divided into k equal parts, in which as a test set, the rest of the training set of training models K, average accuracy

       Leaving a verification: the special cross-validation K

Guess you like

Origin www.cnblogs.com/zhuome/p/11516201.html