Machine Learning--Basic Concepts

Types of machine learning

According to the presence or absence of labels

1. Supervised learning

Supervised learning : Use samples with known certain or certain characteristics as a training set to build a mathematical model, and then use the established model to predict unknown samples. This method is called supervised learning and is the most commonly used A machine learning method. is the machine learning task of inferring a model from a labeled training dataset. (Source: https://www.cnblogs.com/yifanrensheng/p/12076877.html )

1. Discriminative Model : directly model the conditional probability p(y|x). Common discriminative models include: linear regression, decision tree, support vector machine SVM, k-nearest neighbor, neural network, etc.;

(1) Linear regression: nothing to say

(2) Decision tree: nothing to say

(3) Support Vector Machine SVM: What you want is to find the farthest distance from various sample points to the hyperplane, that is, to find the maximum interval hyperplane.

(4) K nearest neighbors: The so-called K nearest neighbors algorithm is given a training data set, and for a new input instance, find the K instances closest to the instance in the training data set (that is, the above-mentioned K neighbors ), most of these K instances belong to a certain class, and the input instance is classified into this class.

(5) Neural Networks: Hurry up and take out your college textbooks

2. Generative Model : Model the joint distribution probability p(x,y). Common generative models include: Hidden Markov Model HMM, Naive Bayesian Model, Gaussian Mixture Model GMM, LDA, etc. ;

(1) Hidden Markov model:

(2) Naive Bayesian model: https://zhuanlan.zhihu.com/p/37575364

(3) Gaussian mixture model GMM: The mixture model is a probability model that can be used to represent K sub-distributions in the overall distribution. In other words, the mixture model represents the probability distribution of the observed data in the population. It is A mixture distribution consisting of K subdistributions.

2. Unsupervised learning

Unsupervised learning: Compared with supervised learning, there are no artificially labeled results in the training set of unsupervised learning. In the unsupervised learning process, the data is not specially marked, and the learning model is to infer some internal structure of the data.

1. Unsupervised learning tries to learn or extract the data features behind the data, or extract important feature information from the data. Common algorithms include clustering, dimensionality reduction, text processing (feature extraction), etc.

2. Unsupervised learning is generally used as the early data processing of supervised learning, and its function is to extract the necessary label information from the original data.

3. Common algorithms:

(1) Clustering: https://blog.csdn.net/qq_40597317/article/details/80949123

(2) Dimensionality reduction: The link above also briefly introduces dimensionality reduction

(3) Text processing:

3. Semi-supervised learning

Semi-supervised learning: Considering how to use a small number of labeled samples and a large number of unlabeled samples for training and classification, it is a combination of supervised learning and unsupervised learning

 

By function 

Classification and regression are representative of classification learning, and clustering is representative of unsupervised learning.

1. Classification: Through the classification model, the samples in the sample data set are mapped to a given category (usually supervised)

2. Clustering: through the clustering model, the samples in the sample data set are divided into several categories, and the samples belonging to the same category have a relatively large similarity (belonging to unsupervised)

3. Regression: It reflects the characteristics of the attribute values ​​of the samples in the sample data set, and discovers the dependencies between the attribute values ​​​​by expressing the relationship between the sample mapping through the function

4. Association rules: Obtain the association or mutual relationship hidden between data items, that is, the frequency of occurrence of other data items can be deduced based on the occurrence of one data item. (also unsupervised)

algorithm name

Algorithm description (a rookie like me needs to understand the concept haha)

C4.5

Classification decision tree algorithm, core algorithm of decision tree, improved algorithm of ID3 algorithm. (to be understood)

CART

Classification and Regression Trees (to be understood)

kNN

K nearest neighbor classification algorithm; if a sample belongs to a certain category among the k most similar samples in the feature space, then the sample also belongs to this category

NaiveBayes

Bayesian classification model; this model is more suitable when the attribute correlation is relatively small. If the attribute correlation is relatively large, the decision tree model is better than the Bayesian classification model (reason: the Bayesian model assumes that the attributes are independent of each other)

SVM

Support vector machine, a statistical learning method with supervised learning, is widely used in statistical classification and regression analysis.

EM

Expectation-Maximation Algorithm, commonly used in the field of data aggregation in machine learning and computer vision

A priori

Association Rule Mining Algorithm

K-Means

Clustering algorithm, the function is to divide n objects into k divisions according to attribute characteristics (k<n); belongs to unsupervised learning

PageRank

One of the important algorithms of Google search

AdaBoost

Iterative algorithm; data classification using multiple classifiers

 

Common nouns for machine learning

1. Generalization ability : refers to the ability of a machine learning algorithm to identify samples that have not been seen before. We also call it the ability to draw inferences from one instance, or the ability to apply what you have learned.

2. Overfitting: It means that the model overfits the training set, but the fitting degree is poor on other data sets. Overfitting can lead to high variance (that is, the difference between the output of the model obtained for different training sets and the expected output of these models)

Overfitting solution: (1) Get more training data (2) Increase the degree of regularization (to be understood)

3. Underfitting means that the model does not capture the characteristics of the data well and cannot fit the data well. The performance of underfitting is that its performance in the training set is poor, and its performance in the test set is also poor. Underfitting can lead to high bias (Bias) (that is, the difference between the expected output of the model and its true output)

Underfitting solutions: (1) use a more complex model (2) reduce the degree of regularization (what is the degree of regularization)

4. Cross Validation

Cross-validation is a statistical analysis method used to verify the performance of the classifier. The basic idea is to group the original data in a certain sense, and one part is used as a training set (training set), and the other part is used as a validation set (validation set). ), first use the training set to train the classifier, and then use the verification set to test the trained model (model), which is used as a performance indicator for evaluating the classifier.

Common cross-validation:

4.1. Simple cross-validation (Hold-Out Method)

Randomly divide the original data into two groups, one as the training set and one as the verification set, use the training set to train the classifier, then use the verification set to verify the model, and record the final classification accuracy as the performance index of the classifier .

The advantage of simple cross-validation is that it is easy to process, and only needs to randomly divide the original data into two groups. Strictly speaking, the Hold-Out Method cannot be regarded as CV, because this method does not achieve the idea of ​​crossover. Since the original data is randomly grouped, the classification accuracy of the final verification set has a great relationship with the grouping of the original data. relation.

4.2, K-fold cross validation (K-fold Cross Validation)

Divide the original data into K groups (generally divided equally), make a verification set for each subset of data, and use the remaining K-1 subsets of data as a training set, so that K models will be obtained, and these K models will be used in the final The average of the classification accuracy of the validation set is used as the performance index of the classifier under this K-CV. K-CV can effectively avoid overfitting and underfitting.

 

Machine Learning Development Process

1. Data collection:

Data source : user access behavior data, business data, external third-party data

data storage:

1. Data to be stored: original data, preprocessed data, model results

2. Storage facilities: mysql, HDFS, HBase, Solr, Elasticsearch, Kafka, Redis, etc.

Data collection method : Flume & Kafka (to be mastered in the direction of big data)

In actual work, we can use business data for machine learning development, but in the learning process, there is no business data. At this time, we can use public data sets for development. Commonly used data sets are as follows:

http://archive.ics.uci.edu/ml/datasets.html

https://aws.amazon.com/cn/public-datasets/

https://www.kaggle.com/competitions

http://www.kdnuggets.com/datasets/index.html

http://www.sogou.com/labs/resource/list_pingce.php

https://tianchi.aliyun.com/datalab/index.htm  Domestic: Tianchi data

http://www.pkbigdata.com/common/cmptIndex.html

2. Data preprocessing

In most cases, the collected data needs to be preprocessed before it can be used by the algorithm. The preprocessing operation mainly includes the following parts:

  1. data filtering
  2. Handling missing data
  3. Handle possible exceptions, errors, or outliers
  4. Merge data from multiple data sources
  5. data summary

Initial preprocessing of the data requires converting it into a representation suitable for machine learning models, which for many model types is vectors or matrices containing numerical data

  1. Encode category data into corresponding numerical representations (generally using one-hot encoding method)
  2. Extract useful data from text data (generally using bag of words method or TF-IDF)
  3. Process image or audio data (pixels, sound waves, audio, amplitude, etc. <Fourier transform>, wavelet transform mainly processes images)
  4. Convert numeric data to categorical data to reduce the value of variables, such as age brackets
  5. Perform transformations on numerical data, such as logarithmic transformations
  6. Regularize and standardize the features to ensure that the value ranges of different input variables of the same model are the same
  7. Combining or transforming existing variables to generate new features, such as averages (doing dummy variables) keep trying

Bag of words method: treat the text as an unordered data set, and the text features can be reflected by the terms T in the text, then all the terms that appear in the text and the number of occurrences can reflect the characteristics of the document

TF-IDF:  The importance of an entry increases proportionally to the number of times it appears in the file, but at the same time it decreases inversely proportional to the frequency it appears in the corpus; that is to say, the more the entry appears in the text, the more More means that the entry is more important to the text, and the less the entry appears in all texts, the more important the entry is to the text. TF (term frequency) refers to the number of times an entry appears in the text, which is generally normalized (the number of entries/the number of all entries in the document); IDF (inverse document frequency) refers to an entry The measure of importance is generally calculated by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient. TF-IDF is actually: TF*IDF

3. Feature extraction

4. Model Construction

Model selection: The selection of the optimal modeling method for a specific task or the selection of the best parameters for a specific model.

5. Model Test Evaluation

1. Run the model (algorithm) on the training data set and test the effect in the test data set, iteratively modify the data model, this method is called cross-validation (dividing the data into training set and test set , using the training set to build model, and evaluate the model using the test set to provide modification suggestions)

2. The selection of the model will select as many algorithms as possible for execution, and compare the execution results

3. The test of the model is generally compared in the following aspects, namely accuracy rate/recall rate/precision rate/F value

  1. Accuracy (Accuracy) = number of correct samples extracted/total number of samples
  2. Recall (Recall) = the number of correct positive samples / the number of positive samples in the sample - coverage
  3. Precision (Precision) = the number of correct positive samples / the number of samples predicted to be positive
  4. F value=Precision*Recall*2 / (Precision+Recall) (that is, the F value is the harmonic mean of the correct rate and recall rate)

6. Model process:

 

 

 

 

Article source: https://www.cnblogs.com/yifanrensheng/p/12076877.html

Guess you like

Origin blog.csdn.net/qq_39849328/article/details/109608031