Machine Learning in Practice--Overview of Machine Learning

Goals and methods

  • Scikit-Learn is very easy to use, and it effectively implements many machine learning algorithms, so it is called an important entry point for learning machine learning.
  • TensorFlow is a more sophisticated library for distributed numerical computation. By distributing computation across hundreds of GPU (graphics processing unit) servers, it can efficiently train and run large neural networks. TensorFlow was created by Google and powers many large-scale machine learning applications.
  • Keras is a high-level deep learning API that makes training and running neural networks very simple. It can run on top of TensorFlow, Theano or Microsoft Cognitive Toolkit (CNTK). TensorFlow ships with its own implementation of this API, called tf.keras, which supports certain advanced TensorFlow features (such as the ability to load data efficiently).

Machine learning applies to

  • Problems that have solutions (but the solutions require a lot of manual fine-tuning or need to follow a lot of rules): Machine learning algorithms can often simplify the code and have better performance than traditional methods.
  • Complex problems that are difficult to solve with traditional methods: the best machine learning techniques may be able to find the solution.
  • The environment fluctuates: Machine learning algorithms can adapt to new data.
  • Gain insight into complex problems and large amounts of data.

Machine learning application examples

  • Analyzing product images on the production line to automatically classify products
    This is an image classification problem, a typical example of using a convolutional neural network (CNN).
  • Finding tumors through brain scans
    This is semantic segmentation, where every pixel in the image needs to be classified (when we want to determine the exact location and shape of the tumor), CNN is also used.
  • Automatically classifying news
    This is Natural Language Processing (NLP), more specifically text classification, which can use Recurrent Neural Networks (RNN), CNN or Transformer.
  • Automatically flag negative comments in forums
    This is also text classification, using the same natural language processing tools.
  • Automatically summarize long articles.
    This is a branch of natural language processing called text summarization, which uses the same tools.
  • Creating a chatbot or personal assistant
    This involves many branches of natural language processing, including natural language understanding (NLU) and question and answer modules.
  • Predicting a company's revenue for the next year based on many performance indicators
    is a regression problem (such as predicting values) and needs to be processed using a regression model, such as linear regression or polynomial regression, SVM regression, random forest regression, or artificial neural network, if considered For past performance indicators, RNN, CNN or Transformer can be used.
  • Making apps react to voice commands
    This is speech recognition, which requires the ability to handle audio samples. Because audio is a very long and complex sequence, it is generally processed using RNN, CNN or Transformer.
  • Detecting credit card fraud
    This is anomaly detection.
  • Classify customers based on their purchase records and design different marketing strategies for each type of customer.
    This is a clustering problem.
  • Represent complex, high-dimensional data sets with clear and insightful diagrams
    This is data visualization that often involves dimensionality reduction techniques.
  • Recommend products that may be of interest to customers based on previous purchase records.
    This is a recommendation system. One method is to input previous purchase records (and other customer information) into an artificial neural network to output the products that customers are most likely to purchase. This neural network is trained on all customer purchase records.
  • Building intelligent robots for games
    This is often solved through reinforcement learning (RL). Reinforcement learning is a branch of machine learning in which agents (e.g. bots) are trained to choose actions that maximize their reward over a period of time (e.g. a bot might take actions every time a player Gain bonuses when losing some health). The famous AlphaGo program that defeated the world champion in a long-term competition was built using RL.

Types of Machine Learning Systems

  • Whether to train under human supervision (supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning)
  • Is it possible to do incremental learning (online learning and batch learning) dynamically?
  • Do you simply match new data points to known data points, or do you do what scientists do, perform pattern detection on training data and build a predictive model (instance-based learning and model-based learning)

Supervised learning and unsupervised learning

Machine learning systems can be divided into four main categories based on the amount and type of supervision they receive during training: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning

  • Supervised Learning
    In supervised learning, the training set containing the desired solution provided to the algorithm is called a label.
    The classification task is a typical supervised learning task.
    A good example of this is a spam filter: train on a large number of examples of emails and the category they fall into (spam or regular), and then learn how to classify new emails.
    Another typical task is to predict a target value (such as the price of a car) given a set of features called predictors (mileage, age, brand, etc.). This type is called regression . To train such a system, one needs to be provided with a large number of examples of cars, including their predictors and labels (i.e. prices).
    Regression problem : predict a value given an input feature (often there are multiple input features and sometimes multiple output values)
    In machine learning, an attribute is a data type (e.g. 'mileage') and the feature depends on context, may have multiple meanings. But usually, a characteristic means an attribute plus its value (for example: "mileage = 15000").
    Some regression algorithms can also be used for classification tasks. Logistic regression is widely used for classification because it can output a value of "probability" of belonging to a given category (for example, 20% probability of being spam)
Important supervised learning algorithms
  • k-nearest neighbor algorithm
  • linear regression
  • logistic regression
  • Support vector machine (SVM)
  • Decision trees and random forests
  • Neural Networks

Unsupervised learning
The training data of unsupervised learning is unlabeled. The system will learn without a "teacher".

Important unsupervised learning algorithms
  • Clustering algorithm
    k-means algorithm
    DBSCAN algorithm
    Hierarchical cluster analysis (HCA)

  • Anomaly Detection and Novelty Detection
    Single-Class SVM
    Isolation Forest

  • Visualization and dimensionality reduction
    Principal Component Analysis (PCA)
    Kernel PCA
    Local Linear Embedding (LLE)
    t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • Association rule learning
    Apriori
    Eclat

Visualization algorithms are also a good example of unsupervised learning algorithms: given a large amount of complex, unlabeled data, the algorithm can easily plot and output a 2D or 3D representation of the data. These algorithms try to preserve as much structure as possible (for example, trying to keep individual clusters of the input from overlapping in the visualization) so that you can understand how the data is organized and even identify unknown patterns.

A related task is dimensionality reduction . The purpose of dimensionality reduction is to simplify the data without losing too much information. One way is to combine multiple related features into one. (For example, there is a strong correlation between the mileage of a car and its age, so the dimensionality reduction algorithm will merge them into a feature that represents the wear and tear of the car. This process is called feature extraction)

It is often better to use a dimensionality reduction algorithm to reduce the dimensionality of the training data before feeding it to another machine learning algorithm (such as a supervised learning algorithm). This will make it run faster, take up less disk space and memory, and in some cases, perform better.

Another important unsupervised task is anomaly detection (e.g. detecting anomalies in credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a data set before feeding it to another machine learning algorithm). The system uses normal instances It trains, and then when it sees a new instance, it can tell whether that instance looks normal or abnormal.

A very similar task is novelty detection . Its purpose is to detect new instances that look different from all instances in the training set (e.g. if there are many photos and 1% of them are Chihuahuas, then a novelty detection algorithm should not consider new pictures of Chihuahuas as novel) . On the other hand, an anomaly detection algorithm might decide that these dogs are very rare and should be different from other dogs, possibly classifying them as outliers.

Another common unsupervised task is association rule learning , whose purpose is to mine large amounts of data and discover interesting connections between attributes (for example, suppose you open a supermarket and run association rules on sales logs and find that buying barbecue sauce and People who buy potato chips also tend to buy steak, so you might place those items closer together).

semi-supervised learning

Since labeling data is generally very time-consuming and expensive, you tend to have a lot of unlabeled data and very little known-labeled data. Some algorithms process partially labeled data. This is called semi-supervised learning (some photo hosting services, Google Photos, are a good example). Once all the photos are uploaded to the server, it will automatically recognize that person A appears in 1, 5, and 11, and person B appears in in photos 2, 5 and 7). This is the unsupervised part of the algorithm (clustering). Now all the system needs you to do is tell it who these people are. Once you give each person a tag, it gives you a name for each person in each photo, which is important for searching through those images.
Most semi-supervised learning algorithms are a combination of unsupervised and supervised algorithms. For example, Deep Idea Networks (DBNs) are based on stacked unsupervised components called Restricted Boltzmann Machines (RBMs). Restricted Boltzmann machines are trained in an unsupervised manner, and then supervised learning techniques are used to fine-tune the entire system.

reinforcement learning

Reinforcement learning is a very different beast. Its learning system (called an agent in its context) can observe the environment, make choices, perform actions, and receive rewards (or negative rewards). form of punishment). So it has to learn on its own what is the best strategy to get the biggest return over time. A policy represents an action that an agent should choose in a specific situation (for example, many robots learn how to walk through reinforcement learning algorithms. DeepMind's AlphaGo project is also a good example of reinforcement learning).

Batch learning and online learning

Another criterion for classifying machine learning systems is whether the system can learn incrementally from an incoming data stream.

batch learning

In batch learning, the system cannot learn incrementally—that is, it must use all available data for training. This requires a lot of time and computing resources, so it's usually done offline. Offline learning is to train the system first and then put it into the production environment. At this time, the learning process stops and it just applies what it has learned.

If you want a batch learning system to learn new data (such as a new type of spam), you need to retrain a new version of the system on the complete data set (including new and old data), then decommission the old system and replace it with the new one.

Fortunately, the entire process of training, evaluating, and launching a machine learning system can be easily automated, so even batch learning systems can adapt to changes. The data just needs to be constantly updated and new versions of the system trained as frequently as needed.

This solution is relatively simple and usually works fine, but training with the full data set each time can take several hours, so you will most likely choose to train a new system every day or even every week. If the system needs to deal with rapidly changing data (for example, predicting stock prices), then you need a more impactful solution. In addition, training with the complete data set requires a lot of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.). If your data volume is very large and you automatically perform training from scratch every day system, then you will end up spending a lot of money on it. And if you are faced with massive data, you may not even be able to apply batch learning algorithms.

Online Learning

In online learning, you can provide training data to the system step by step and gradually accumulate learning results. This way of providing data can be either individually or using small batches of small group data for training. Each step of learning is fast and cheap, allowing the system to learn from the latest data being written at a rapid pace.
In online learning, a model is trained and put into production, then continues to learn as new data comes in.
Using an online learning system is a great way to receive a continuous stream of data (such as stock prices) while reacting quickly or autonomously to changes in the data stream. Once a new data instance has been learned by the online learning system, it is no longer needed and can be discarded. This saves a lot of space.
Online learning algorithms are also suitable for very large data sets - data that exceeds the main memory of a computer (this is called out-of- core learning ). The algorithm only loads part of the data at a time and trains on this part of the data, and then repeats this process until all data is trained.
Out-of-core learning is usually done offline (that is, not on a live system), so the name online learning is easily misleading. We can think of this as incremental learning.
An important parameter of an online learning system is the speed with which it adapts to changing data, this is known as the learning rate . If the learning rate is set high, the system will quickly adapt to new data, but will also quickly forget old data. Conversely, if the learning rate is low, the system will be more inert, that is, it will learn more slowly and will be less sensitive to noise in new data or to sequences of atypical data points (outliers). .
A major challenge in online learning is that if you feed the system bad data, its performance will gradually degrade.

Instance-based learning vs. model-based learning

Another way to classify machine learning systems is by how well they generalize. Most machine learning tasks are about making predictions. This means that the system needs to be given a training example and make predictions (generalize) on examples it has not seen before. Achieving good performance metrics on training data is important, but not sufficient. The real purpose is to perform well on new object instances.
There are two main methods of generalization: instance-based learning and model-based learning .

Example-based learning

Known as instance-based learning: the system learns these examples by heart and then generalizes to the new instances by using a similarity measure to compare the new instances to already learned instances (or a subset of them).

model-based learning

Another way to achieve generalization from a set of examples is to build a model of those examples and then use the model to make predictions. This is called model-based learning.

Determine value pairs to measure model performance

Either define a utility function (or fitness function) to measure how good the model is, or define a cost function to measure how bad the model is. For linear regression problems, a common choice is to use a cost function to measure the gap between the predictions of the linear model and the training instances, with the goal of minimizing this gap.
This is exactly what the linear regression algorithm is about: using the training samples you provide, find the parameters that best fit the linear model of the provided data. This is called training the model.
Model selection involves selecting the type of model and fully specifying its architecture. Training a model means running an algorithm that finds the parameters of the model that best fit the training data (and hopefully make good predictions on new data).

Use sklearn to train and want to run a linear model

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")  
    #pandas.DataFrame.pivot 返回由给定索引/列值组织的重新整形的DataFrame
    #print(oecd_bli)
    gdp_per_capita.rename(columns={
    
    "2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True) 
    #pandas.merge 使用数据库样式连接合并 DataFrame或命名的Series对象。 
    full_country_stats.sort_values(by="GDP per capita", inplace=True) 
    #以GDP per capita为索引进行排序
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    #remove_indices = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] #去除一些样本
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices] #pandas.DataFrame.iloc 基于整数位置的索引,用于按位置选择。
# Load the data
oecd_bli = pd.read_csv("./datasets/lifesat/oecd_bli_2015.csv",thousands = ',')
gdp_per_capita = pd.read_csv("./datasets/lifesat/gdp_per_capita.csv",thousands = ',',delimiter = '\t',encoding = 'latin_1', na_values = "n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind = 'scatter', x = "GDP per capita", y = 'Life satisfaction')
plt.show()

Insert image description here

# Select a linear model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit( X,y )
t0,t1 = model.intercept_[0],model.coef_[0][0]
t0,t1

(4.853052800266436, 4.911544589158483e-05)

# Make a prediction for Cyprus
X_new = [[22587]]  #输入塞浦路斯的人均GDP
print(model.predict(X_new)) # outputs [[ 5.96242338]] 预测该国幸福指数

[[5.96242338]]

#将线性回归模型换为k-近邻回归模型
import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors = 3)
model.fit(X,y)
# Make a prediction for Cyprus
X_new = [[22587]]  #输入塞浦路斯的人均GDP
print(model.predict(X_new)) # outputs [[ 5.96242338]] 预测该国幸福指数

[[5.76666667]]

in short:

  • research data
  • Select model
  • Use training data for training (that is, the process in which the previous learning algorithm searches for model parameter values ​​to minimize the cost function)
  • Finally, the model is applied to make predictions on new examples (called inference), hoping that the model generalizes well.

Main challenges of machine learning

Simply put, since your main task is to choose a learning algorithm and train it on some data, the two most likely problems are "bad algorithm" and "bad data".

Differences in training data

Most machine learning algorithms require large amounts of data to work properly. Even the simplest problems will most likely require thousands of examples, and for complex problems such as image or speech recognition, millions of examples may be required.

The training data is not representative

In order to generalize well, it is crucial that the training data be very representative of the new examples to which it will generalize. This is true whether you are using instance-based learning or model-based learning.

It is crucial to use a training set that is representative of the cases you want to generalize to. But it is not easy to implement. If the sample set is too small, sampling noise will occur (that is, non-representative data is selected); and even for very large sample data, if the sampling method is improper, it may also lead to non-representative data. set, this is the so-called sampling bias.

low quality data

If the training set is full of errors, outliers, and noise (data generated by low-quality measurements), the system will have a harder time detecting the underlying patterns and be less likely to perform well. So taking the time to clean the training data is well worth the investment.

  • If some instances are clearly anomalies, it can be helpful to simply discard them or try to fix the errors manually.
  • If some instances are missing some features (e.g., 5% of customers do not specify an age), you must decide whether to ignore the features entirely, ignore the missing instances, or fill in the missing values ​​(e.g., fill in the median age value). number), it is better to train a model with this feature, and then train a model without this feature.
irrelevant features

Only when the training data contains enough relevant features and fewer irrelevant features can the system complete learning. A key part of a successful machine learning project is extracting a good feature set for training.
This process is called feature engineering and includes the following points:

  • Feature selection (selecting the most useful features from existing features for training)
  • Feature extraction (integrating existing features to produce more useful features - dimensionality reduction algorithms can help)
  • Create new features by collecting new data.
Overfitting training data

A model that performs well on the training data but fails to generalize is called overfitting.
Complex models such as deep networks can detect tiny patterns in the data, but if the training set itself is noisy, or the data set is too small (sampling noise is introduced), it is likely to cause the model to detect patterns in the noise itself. . Obviously, these patterns cannot generalize to new instances.
Overfitting occurs when a model is too complex relative to the amount and noise of the training data. Possible solutions are as follows:

  • Simplify the model: You can choose a model with fewer parameters (for example, choose a linear model instead of a higher-order polynomial model), you can reduce the number of attributes in the training data, or you can constrain the model.
  • Collect more training data.
  • Reduce noise in training data (e.g., fix data errors and eliminate outliers)

Constraining a model to keep it simple and reduce the risk of overfitting is a process called regularization. There is a need to find the right balance between matching the data perfectly and keeping the model simple to ensure that the model generalizes better.

During learning, the degree to which regularization is applied can be controlled through a hyperparameter. Hyperparameters are parameters of the learning algorithm (not the model). Therefore, it is not affected by the algorithm itself. Hyperparameters must be set before training and remain constant during training. If you set the regularization hyperparameter to a very large value, you get an almost flat model (slope close to zero). Although the learning algorithm will certainly not overfit the training data, it will be even less likely to find a good solution. Tuning hyperparameters is a very important part of building a machine learning system.

Underfitting training data

Underfitting and overfitting are the exact opposite. It usually occurs because the underlying data structure, such as a model, is too simple. (Using a linear model to describe life satisfaction is underfitting).
The main ways to solve underfitting:

  • Choose a more powerful model with more parameters.
  • Provide learning algorithms with better feature sets (feature engineering).
  • Reduce constraints in the model (reduce regularization hyperparameters).

About machine learning

  • Machine learning is the theory of how to make machines better at certain tasks by learning from data without explicitly encoding the rules.
  • There are many types of machine learning systems: supervised and unsupervised, batch and online, instance-based and model-based, etc.
  • In a machine learning project, you collect data from a training set and then give the data to a learning algorithm to calculate. If the algorithm is model-based, it adjusts some parameters to fit the model to the training set (i.e., makes good predictions about the training set itself), and then the algorithm can make reasonable predictions for new scenarios. If the algorithm is instance-based, it remembers these examples and generalizes to these new instances by comparing them to learned instances based on a similarity measure.
  • If the training set has too little data or is not representative enough, contains too much noise, or is contaminated by irrelevant features, the system will not work well. Finally, the model should be neither too simple (which would lead to underfitting) nor too complex (which would lead to overfitting).

Testing and Validation

The only way to understand a model's ability to generalize to new scenarios is to actually let the model handle the new scenarios. One way to do this is to deploy it in a production environment and then monitor its output.
A better option is to split the data into two parts: training set and test set. Use the training set data to train the data and the test set data to test the model. The error rate in responding to new scenarios is called the generalization error (or out-of-sample error), and you can get an estimate of this error by evaluating your model on the test set. This estimate tells you how well your model handles new scenarios.
If the training error is low (the model rarely makes mistakes on the training set), but the generalization error is high, then your model is overfitting the training data.
Usually 80% of the data is used for training and 20% is kept for testing. However, this depends on the size of the data set.

Hyperparameter tuning and model selection

Evaluating a model is easy: just use the test set.
When choosing two models, you need to train both models and then compare their generalization capabilities on the test data.
Now suppose that the linear model generalizes better, but needs to apply some regularization to avoid overfitting, train 100 different models by using 100 different hyperparameter values, and then suppose that the optimal hyperparameter is found from this Parameter values ​​that produce a model with minimal generalization error, say just 5%, and then run the model in a production environment.
However, the problem of large errors may occur because the generalization error of the test set is measured multiple times, and the model and hyperparameters are adjusted to obtain the best model to fit that test set. This means that the model is unlikely to perform well on new data.
A common way to solve this problem is called holdout validation: you only need to hold out a portion of the training set to evaluate several candidate models and choose the best one. The new holdout set is called the validation set, sometimes also called the development set. More specifically: multiple models with various hyperparameters can be trained on a reduced training set (i.e., the full training set minus the validation set) and the model that performs best on the validation set is selected. After this holdout validation, you train the best model on the full training set (including validation set), and this is the final model. Finally the model is evaluated on the test set to obtain an estimate of the generalization error.
Perform cross-validation using many small validation set iterations. Each model is evaluated once on each validation set after training on the remaining data. By averaging all evaluations of the model, you can get a more accurate measure of the model's performance. However, the time consumed is a multiple of the number of verification sets.

Data does not match

In some cases, it is easy to obtain large amounts of training data, but this data may not fully represent the data that will be used in a production environment.
The validation and test sets must be equally representative of the data used in the production environment.

おすすめ

転載: blog.csdn.net/weixin_45867259/article/details/132417610