How to choose the right machine learning algorithm

 

  1. understand the data

The type and shape of the data we have plays a key role when we decide which algorithm to use. Some algorithms can work with relatively small sample sets, while others require larger samples. Certain algorithms work on certain types of data. For example, the Naive Bayes algorithm works particularly well with inputs to be classified, but is not at all sensitive to missing values. Therefore, you need to do:

Know your data:

  1. View the results of summary statistics and data visualizations
  1. Percentages can help you recognize the range of most data
  2. The mean and median can describe central tendency
  3. Correlation coefficients can indicate strong associations
  1. data visualization
  1. Box plots can identify outliers
  2. Density plots and histograms can show the spread of data
  3. Scatterplots can describe binary relationships
  1. Data cleaning:
  1. Handle missing values. Missing data affects some models more than others, and even for those models that are used to deal with missing data, they may be sensitive to missing data (missing data for some variables may lead to poor predictiveness )
  2. Choose a method for handling outliers
  1. Outliers are common in multidimensional data
  2. Some models are less sensitive to outliers than others. In general, numerical models are less sensitive to the presence of outliers. However regression models, or any model that tries to use equations for that matter, can be severely affected by outliers.
  3. Outliers can be the result of poor data collection, or they can be reasonable extremes.
  1. data augmentation

Feature engineering is the process of generating data that can be used for modeling from raw data, which can play the following roles

  1. Make the model easier to interpret (e.g. data binning)
  2. Capturing more complex relationships (like neural networks)
  3. Reduce data redundancy and reduce data dimensionality (such as principal component analysis (PCA))
  4. Rescale variables (like standardization or normalization)

Different models may have different feature engineering requirements. Some models have built-in feature engineering.

  1. Categorize the problem

The next step is to classify the problem, which is a two-step process

  1. according to input points
  1. If you have labeled data, then this is a supervised learning problem.
  2. If you have unlabeled data and want to find useful structure from it, then this is an unsupervised learning problem.
  3. If you want to optimize an objective function by interacting with the environment, then that's a reinforcement learning problem.

2. According to the output classification:

  1. If the output of the model is a (continuous) number, then this is a regression problem.
  2. If the model output is a category, then it is a classification problem.
  3. If the output of the model is a set of clusters divided by the input data, then it is a clustering problem.
  4. Do you want to find an outlier? At this point you are facing an outlier detection problem.

5. Understand the constraints you need to satisfy

  1. You need to consider how much capacity you can store data? Depending on the storage capacity of the system, you may not be able to store several gigabytes of classification, regression models or several gigabytes of data for cluster analysis. For example, in embedded systems, you will face such a situation.
  2. Is there a requirement for the speed of the forecasting process? In real-time applications, it is clear that it is important to get predictions as quickly as possible. For example, in the autonomous driving problem, the application must classify road signs as quickly as possible to avoid traffic accidents.
  3. Is there a requirement for the speed of the learning process? In some cases, fast model training is necessary: ​​Sometimes, you need to update your model quickly in real time with different datasets.
  1. Find available algorithms

When you have a clear understanding of your task environment, you can use the tools at your disposal to identify feasible algorithms that are applicable to the problem you are solving. Some of the factors that influence your choice of model are as follows:

  1. Does the model meet business goals
  2. How much data preprocessing work the model requires
  3. how accurate the model is
  4. How interpretable is the model
  5. How fast does the model run: How long does it take to construct the model? How long does it take for the model to make predictions?
  6. How scalable is the model

Model complexity is an important criterion affecting algorithm selection. In general, a more complex model has the following characteristics:

  1. It relies on more features for learning and prediction (e.g. using ten features instead of two to predict an object)
  2. It relies on more complex feature engineering (e.g. using multiple features, interaction features or principal components)
  3. It has greater computational overhead (e.g. requires a random forest of 100 decision trees instead of individual decision trees)

Besides that, the same machine learning algorithm can be made more complex based on the number of parameters and the choice of certain hyperparameters:

For example:

Regression models can have more features, or multinomial and interaction terms.

Decision trees can have greater or lesser depth.

Making the same algorithm more complex increases the chance of overfitting.

Commonly used machine learning algorithms:

linear regression

This is probably the simplest algorithm in machine learning, for example, a regression algorithm is used when you want to compute some continuous value, instead of categorizing the output. Therefore, when you need to predict the future of a running process, you use regression algorithms. However, linear regression is unstable when features are redundant, i.e. if there is multicollinearity.

Linear regression can be considered in the following situations:

  1. The time it takes to move from one place to another
  2. Predict the sales of a product in the next month
  3. The Effect of Blood Alcohol Level on Coordination
  4. Forecast monthly gift card sales and improve annual revenue estimates

Logistic regression

Logistic regression performs binary classification and thus outputs binary labels. It takes as input a linear combination of features and applies a nonlinear function to it, so it is a very small instance of a neural network.

Logistic regression provides many ways to regularize your model, so as in Hacktree Bayes algorithm, you don't have to worry about whether your features are correlated or not. The model has a good probabilistic interpretation, unlike decision trees or support vector machines, you can easily update your model for new data. If you want a probabilistic framework, or you want to be able to quickly incorporate more training data into your model in the future, you can use the logistic regression algorithm. Logistic regression can also provide a good understanding of the factors behind the predicted results, it is not entirely a black box method. The following situations can consider using the logistic regression algorithm:

  1. Predict Churn
  2. Credit Scoring and Fraud Detection
  3. Evaluate marketing campaigns

decision tree

Decision trees are rarely used alone, but different decision trees can be combined into very efficient algorithms, such as random forests or gradient boosting algorithms.

Decision trees handle feature interactions easily, and decision trees are a non-parametric model, so you don't have to worry about outliers or whether the data is linearly separable. A disadvantage of decision tree algorithms is that they do not support online learning, so when you want to use new samples, you have to rebuild the decision tree. Another disadvantage of decision trees is that they are prone to overfitting, and this is where ensemble learning methods such as random forests (or boosted trees) can come in handy. Decision trees also require a lot of storage space (the more features you have, the deeper and larger your decision tree can be). Decision trees are great for helping you choose among many courses of action:

  1. make investment decisions
  2. Predict Churn
  3. Find out who may default on a bank loan
  4. Choose between the two options [Build] and [Buy]
  5. Qualification review of sales executives

k- means

Sometimes, you have no label information for your data at all, and your goal is to label objects based on their characteristics. This kind of problem is called a clustering task. Clustering algorithms can be used in this situation: for example, when you have a large group of users and you want to divide them into some specific groups based on some attributes they provide. The biggest disadvantage of this method is that the k-means algorithm needs to know in advance how many clusters your data will have, so this may require a lot of experimentation [guessing] we finally define the optimal number-K.

Principal Component Analysis (PCA)

Principal component analysis can reduce the dimensionality of the data. Sometimes, you have a variety of features, and the correlation between these features may be high, and the model may overfit if it uses such a large amount of data. At this time, you can use the variable method of features of PCA.

Support Vector Machines

Support Vector Machines (SVM) are a supervised machine learning technique widely used in pattern recognition and classification problems - when your data happens to have two classes.

The support vector machine has a high accuracy rate, which is a good theoretical guarantee for preventing overfitting. When you use a suitable function, they can work well even if your data is linearly inseparable in the base (low-dimensional) feature space. Support Vector Machines are very popular for text classification problems, where a very high-dimensional space is not uncommon for the input. However, SVM is a memory-intensive algorithm, it is difficult to interpret, and it is very difficult to tune it.

Tuning is difficult in the following situations:

  1. Find people with common diseases like diabetes
  2. handwritten character recognition
  3. Text classification - classify articles by topic
  4. Stock Market Price Prediction

Hackberry Bayes

This is a classification technique based on Bayesian theorem, which is easy to construct and works well for large-scale datasets. Besides its structural simplicity, Naive Bayes is said to perform even better than some much more complex classification methods. Naive Bayes is also a good option when CPU and memory resources are limited.

Naive Bayes is very simple, you are just doing a lot of counting work, if the conditional independence assumption of Naive Bayes is indeed true, the convergence speed of Naive Bayes will be faster than discriminative models such as logistic regression, so it needs to be trained With less data, Naive Bayes classifiers tend to perform well even when the Naive Bayes assumption does not hold. If you want to use a fast, simple model with good performance. Naive Bayes is a good choice. The biggest disadvantage of this algorithm is that it cannot learn the interaction between features.

  1. Sentiment Analysis and Text Classification
  2. Similar to propulsion systems like Netflix, Amazon
  3. face recognition

random forest

Random forest is an ensemble method of decision trees, which can simultaneously solve regression problems and classification problems with large-scale data sets, and also helps to find the most important variables from thousands of input variables. Random forests are very scalable, they work with data of any dimensionality, and they usually have pretty good performance. In addition, there are genetic algorithms that scale well to any dimensionality and data with minimal knowledge about the data itself, among which the simplest implementation is the microbial genetic algorithm, however, random forest learning can be slow (depending on parameter settings), and this approach cannot iteratively improve generative models.

  1. predict high-risk patients
  2. Predict failure of parts in production
  3. Predict who will default on a loan

Neural Networks

A neural network contains weights of connections between neurons that are balanced and learned one data point at a time. When all the weights are trained, the neural network can be used to predict a classification result or a specific value if regression is required for a new given data point. With neural networks, particularly complex models can be trained and exploited as a black-box approach, without the need for unpredictable and complex feature engineering prior to training the model. Combined with [deep learning], even more unpredictable models can be used to achieve new tasks. For example, the results of object recognition tasks have recently been greatly improved by deep neural networks. Deep learning is also applied to unsupervised learning tasks such as feature extraction, and can also extract features from raw images or language with less human intervention.

Neural networks, on the other hand, are difficult to explain and their parameter settings are incredibly complex. Also, neural network algorithms are incredibly complex, and in addition, neural network algorithms are resource-intensive and memory-intensive.

in conclusion

Generally speaking, some algorithms can be screened out according to the requirements introduced above, but it is the most difficult to know which method at the beginning. You're better off iterating the process of choosing an algorithm several times. Input your data to those potentially good machine learning algorithms you have identified, run these algorithms in parallel or in series, and finally evaluate the performance of the algorithms to choose the best algorithm.

Finding the right solution to a problem in real life is often not just a matter of applying mathematical methods. This requires us to have an understanding of business needs, rules and systems, and concerns of related interests, as well as a lot of professional knowledge. Being able to combine and balance these problems while solving a machine learning problem is crucial.

 

Guess you like

Origin blog.csdn.net/qq_31807039/article/details/81946489