Personal Notes on Common Machine Learning Algorithms

Artificial intelligence is the simulation of the information process of human consciousness and thinking.

Machine learning is a general term for a class of algorithms that attempt to mine hidden laws from a large amount of historical data and use them for prediction or classification. More specifically, machine learning can be seen as looking for a function whose input is sample data , the output is the desired result , but this function is too complicated to be formally expressed. It should be noted that the goal of machine learning is to make the learned function work well on "new samples", not just perform well on training samples. The ability of the learned function to apply to new samples is called the generalization ability.

Machine learning can be regarded as an application of mathematical statistics. A common task in mathematical statistics is fitting , that is, given some sample points, using a suitable curve to reveal the relationship between these sample points and the change of independent variables.

1. Decision tree

Classify according to some features, each node asks a question, through judgment, divides the data into two categories, and then continues to ask questions. These problems are learned based on existing data, and when new data is input, the data can be divided into appropriate leaves according to the problems on the tree.

Decision trees are a method of machine learning. Decision tree generation algorithms include ID3, C4.5 and C5.0, etc. A decision tree is a tree structure in which each internal node represents a judgment on an attribute, each branch represents the output of a judgment result, and finally each leaf node represents a classification result.

Decision tree is a very common classification method that requires supervised learning (Supervised Learning with teachers). Supervised learning is to give a bunch of samples, each sample has a set of attributes and a classification result, that is, the classification result is known. Then a decision tree is obtained by learning these samples, and this decision tree can give the correct classification to the new data.

Which feature in the data set plays a decisive role in dividing the data classification

basic knowledge:

Entropy: It only depends on the distribution of X, and has nothing to do with the value of X. Entropy is used to measure uncertainty. When the entropy is larger, the probability that X=xi is more uncertain, and vice versa. Information entropy reflects the complexity of data, the higher the entropy, the more mixed data. If the addition of a certain feature leads to a large increase in entropy, then the "influence" of this feature is very large.

Information gain: In the decision tree algorithm, it is an indicator used to select features. The greater the information gain, the better the selectivity of this feature.

Advantages: The computational complexity is not high, and the output results are easy to understand.
Disadvantages: Over-matching problems may occur (the output results fully satisfy the current data, but do not have the value of general promotion)
Applicable scenarios: multi-dimensional feature data

2. Random Forest

Random forest is an algorithm that integrates multiple trees through the idea of ​​ensemble learning. Its basic unit is a decision tree, and its essence belongs to a large branch of machine learning - the ensemble learning method. There are two keywords in the name of random forest, one is "random" and the other is "forest". We understand "forest" very well. If one tree is called a tree, then hundreds or thousands of trees can be called a forest. This metaphor is very appropriate. In fact, this is also the main idea of ​​random forest - the embodiment of integration thinking.

In fact, from an intuitive point of view, each decision tree is a classifier (assuming that it is now targeting a classification problem), then for an input sample, N trees will have N classification results. The random forest integrates all the classification voting results, and designates the category with the most votes as the final output. This is the simplest Bagging idea.

features

  1. In all current algorithms, it has excellent accuracy/It is unexcelled in accuracy among current algorithms;

  2. It runs efficiently on large data bases;

  3. Can handle input samples with high-dimensional features, and does not require dimensionality reduction/It can handle thousands of input variables without variable deletion;

  4. Be able to evaluate the importance of each feature in the classification problem/It gives estimates of what variables are important in the classification;

  5. During the generation process, an unbiased estimate of the internal generation error can be obtained/It generates an internal unbiased estimate of the generalization error as the forest building progresses;

  6. Good results can also be obtained for the default value problem/It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing

https://www.cnblogs.com/maybe2030/p/4585705.html

3. Return

The general steps for the Regression problem are:

  1. Find the h function (ie hypothesis);
  2. Construct J function (loss function);
  3. Find a way to minimize the J function and find the regression parameters (θ)

Logistic regression:
Although Logistic regression has "regression" in its name, it is actually a classification method, mainly used for two-category problems (that is, there are only two outputs, representing two categories), so the Logistic function (or called is the Sigmoid function).

The Cost function and the J function are as follows, which are derived based on maximum likelihood estimation.

For multi-class classification problems, it can be regarded as a two-class classification problem: one class is reserved, and the rest is another class. For each class i, train a logistic regression model classifier and predict the probability when y = i; for a new input variable x, predict each class separately, and take the class with the highest probability as the classification result

4. K nearest neighbor algorithm

From the calculation process of K-nearest neighbors, it can be seen that although the K-nearest neighbors algorithm is simple in principle, all data must be stored in practical applications, and the distance calculation for each data in the data set is required, which consumes a lot of computing resources. Therefore, the k-nearest neighbor algorithm is generally not applied to complex classification problems.

insert image description here

insert image description here

insert image description here

Advantages: high precision, less affected by abnormal discrete values

Disadvantages: need a lot of storage space, high computational complexity

Applicable scenarios: a small amount of data and a large amount of low-dimensional data

https://www.zhihu.com/question/26726794/answer/424154856

5. SVM (Support Vector Machine)

The core idea of ​​SVM is to find the interface between different categories, so that the two types of samples fall on both sides of the surface as much as possible, and as far as possible from the interface.

Applicable scenarios: SVM has excellent performance on many data sets. Relatively speaking, the nature of SVM to try to keep the distance from the sample makes it more resistant to attacks.

6. Neural network

The core idea is to use training samples to gradually improve the parameters. Or give an example of predicting height, if one of the input features is gender (1: male; 0: female), and the output feature is height (1: tall; 0: short). Then when the training sample is a tall boy, in the neural network, the route from "male" to "tall" will be strengthened. In the same way, if a tall girl comes, the line from "female" to "tall" will be strengthened. In the end, which routes of the neural network are stronger are determined by our samples.

7. Federated Learning

Federated learning is a distributed machine learning framework with privacy protection and secure encryption technology , which aims to allow decentralized participants to collaborate in machine learning model training without disclosing private data to other participants

The training process of the classic federated learning framework can be briefly summarized as the following steps:

  1. The coordinator establishes the basic model and informs the participants of the basic structure and parameters of the model;
  2. Each participant uses local data for model training and returns the results to the coordinator;
  3. The coordinator summarizes the models of each participant to build a more accurate global model to improve the performance and effect of the model as a whole.

The federated learning framework includes many technologies, such as traditional machine learning model training technology, algorithm technology for coordinator parameter integration, communication technology for efficient transmission between coordinator and participants, and encryption technology for privacy protection. In addition, there is an incentive mechanism in the federated learning framework, data holders can participate, and the benefits are universal.

There are two types of federated learning architectures, one is centralized federated ( client/server ) architecture, and the other is decentralized federated ( peer-to-peer computing) architecture.

For the federated learning scenario of joint multi-party users, the client/server architecture is generally adopted, and the enterprise acts as the server to coordinate the global model;

For the scenario of uniting multiple enterprises facing the dilemma of data islands for model training, a peer-to-peer architecture can generally be adopted, because it is difficult to select a server for coordination from multiple enterprises.

Advantages and Prospects of Federated Learning

The distributed machine learning framework collects data centrally, then stores the data in a distributed manner, and distributes tasks to multiple CPU/GPU machines for processing, thereby improving computing efficiency. The difference is that federated learning emphasizes saving the data locally at the participant from the beginning, and adding privacy protection technology in the training process, which has better privacy protection features .

The data of each participant is always stored locally. During the modeling process, the databases of each party still exist independently, and the parameter interaction during joint training is also encrypted. All parties use strict encryption algorithms when communicating, making it difficult to leak the original data. Data-related information, so federated learning ensures data security and privacy.

In addition, federated learning technology can make the model effect obtained by distributed training almost the same as that of traditional central training. The global model trained is almost lossless, and all participants can benefit together.

With the rapid development of big data and artificial intelligence, federated learning solves the problems of data unavailability and privacy leakage in the training of artificial intelligence models, so the application prospect is very broad. Federated learning can be used for model training under massive data sets to realize the linkage between departments, enterprises and organizations. For example:

  • In the field of smart finance, more accurate business models can be established based on multi-party data, so as to achieve reasonable pricing, targeted business promotion, and corporate risk control assessment;

  • In the smart city, realize the cooperation between various government agencies, enterprises and governments, realize more accurate real-time traffic forecasting, more simplified agency procedures, more efficient information content query, more comprehensive safety prevention and control detection, etc. ;

  • In smart medical care, federated learning can integrate data between hospitals, improve the accuracy of medical image diagnosis, and give early warning of patients' physical conditions.

https://blog.csdn.net/zw0Pi8G5C1x/article/details/116810484

Guess you like

Origin blog.csdn.net/mossfan/article/details/123401565