"Machine Learning in Practice: Based on Scikit-Learn, Keras and TensorFlow Version 2" - Study Notes (1)

Chapter 1 Overview of Machine Learning

· Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (O'Reilly). Copyright 2019 Aurélien Géron, 978-1-492-03264-9. · Learning time:
2022.03.28 ~2022.03.29

Note: SciKit-Learn is part of the SciKit library. SciKit means SciPy Tookits, the name comes from the SciPy library. SciKit is built on the SciPy library. In addition to SciKit-Learn, it also contains many other modules. The SciKit-Learn library is a module focused on machine learning and data mining. There is basically no code involved in this chapter, and the structure is as follows:

19 questions

Before everything starts, you can look at the following nineteen questions. If you can answer them, you can skip this chapter: (see the answer at the end)

1. How to define machine learning?

2. On which problems does machine learning excel, can you give four types?

3. What is a labeled training dataset?

4. What are the two most common supervised learning tasks?

5. Can you name four common unsupervised learning tasks?

6. To get a robot to walk on various unknown terrains, what type of machine learning algorithm would you use?

7. To divide customers into groups, what type of algorithm would you use?

8. Would you frame the problem of spam detection as supervised or unsupervised?

9. What is an online learning system?

10. What is out-of-core learning?

11. What types of learning algorithms rely on similarity to make predictions?

12. What is the difference between model parameters and hyperparameters of a learning algorithm?

13. What do model-based learning algorithms search for? What are the strategies they use most often? How do they make predictions?

14. Can you give four major challenges in machine learning?

15. What if the model performs well on the training data, but generalizes poorly when applied to new instances? Can you give three possible solutions?

16. What is a test set and why is it used?

17. What is the purpose of the validation set?

18. What is the train-dev set, when is it needed, and how to use it?

19. What will go wrong if you use the test set to tune hyperparameters?

The answer is at the end.

1.1 What is machine learning

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

A more engineering concept: A computer program uses experience E to learn task T, and its performance is P. If the performance P of task T increases with experience E, it is called machine learning.

1.2 Why use machine learning

Another strength of machine learning is its ability to deal with problems that are too complex for traditional methods or for which there are no known algorithms.

Machine learning applies to:

  • Problems that have solutions (but the solutions require a lot of human fine-tuning or follow a lot of rules): Machine learning algorithms can often simplify the code and have better performance than traditional methods.

  • Complex problems difficult to solve with traditional methods: The best machine learning techniques may have a solution.

  • The environment fluctuates: Machine learning algorithms can adapt to new data.

  • Gain insight into complex problems and large volumes of data.

1.4 Types of Machine Learning Systems

Whether to train under human supervision (supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning).

Is it possible to do incremental learning dynamically (online learning and batch learning).

Whether to simply match new data points with known data points, or to do pattern detection on training data and build a predictive model like scientists do (instance-based learning and model-based learning).

These criteria are not mutually exclusive and you can combine them in any combination you like.

For example, today's state-of-the-art spam filters may use deep neural network models trained on spam and regular emails to complete dynamic learning. This makes it an online, model-based supervised learning system.

supervised learning

In supervised learning, the training set provided to the algorithm that contains the desired solution is called the label.

Classification task is a typical supervised learning task.

A good example of this is a spam filter: train on lots of examples of emails and the category they belong to (spam vs. ham) and then learn how to classify new emails

Another typical task is to predict a target value (such as the price of a car) given a set of features called predictors (mileage, age, make, etc.). This type of task is called regression. To train such a system, a large number of examples of cars is required, including their predictors and labels (i.e. prices).

In machine learning, an attribute is a data type (e.g. "mileage"), whereas a feature can have multiple meanings depending on the context, but typically a feature means an attribute plus its value (e.g., "mileage = 15 000"). Even so, many people use the terms attribute and characteristic without distinction.

Here are some of the most important supervised learning algorithms (covered in this book):

  • k-nearest neighbor algorithm

  • linear regression

  • logistic regression

  • Support Vector Machines (SVMs)

  • Decision Trees and Random Forests

  • Neural Networks

unsupervised learning

The training data for unsupervised learning is unlabeled. The system learns without a "teacher"

Here are some of the most important unsupervised learning algorithms (mostly covered in Chapters 8 and 9):

  • Clustering Algorithm

  • k-means algorithm

  • DBSCAN

  • Hierarchical Cluster Analysis (HCA)

  • Anomaly Detection and Novelty Detection

  • One-class SVM

  • isolated forest

  • Visualization and Dimensionality Reduction

  • Principal Component Analysis (PCA)

  • Kernel Principal Component Analysis

  • Locally Linear Embedding (LLE)

  • t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • association rule learning

  • Apriori Radiance

A related task is dimensionality reduction, which aims to simplify data without losing too much information. One approach is to combine multiple related features into one.

For example, the mileage of a car has a strong correlation with its age, so a dimensionality reduction algorithm will combine them into one feature representing the wear and tear of the car. This process is called feature extraction

It is usually good practice to use a dimensionality reduction algorithm to reduce the dimensions of the training data before feeding it to another machine learning algorithm (such as a supervised learning algorithm). This will make it run faster, take up less disk space and memory for the data, and in some cases, perform better

Another important unsupervised task is anomaly detection

For example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a data set before feeding it to another machine learning algorithm.

The system is trained with normal examples, and then when it sees a new instance, it can tell whether the new instance looks normal or abnormal

A very similar task is novelty detection. Its purpose is to detect new instances that look different from all instances in the training set.

This requires a very "clean" training set, without any instances that you want the algorithm to detect.

For example, if you have thousands of pictures of dogs and 1% of them are chihuahuas, then a novelty detection algorithm should not consider new pictures of chihuahuas as novel.

On the other hand, an anomaly detection algorithm might consider these dogs to be very rare, unlike other dogs, and might classify them as anomalous (no disrespect to chihuahuas)

Finally, another common unsupervised task is association rule learning, which aims to mine large amounts of data and discover interesting connections between attributes.

For example, suppose you run a supermarket and, after running an association rule on the sales log, find that people who buy BBQ sauce and chips also tend to buy steak. Then, you might place these items closer together

semi-supervised learning

Since labeling data is usually very time-consuming and expensive, you tend to have a lot of unlabeled data and very little labeled data.

Some algorithms can work with partially labeled data. This is called semi-supervised learning

Some photo hosting services, such as Google Photos, are good examples. Once you've uploaded all your family photos to the server, it will automatically recognize that person A is in photos 1, 5, and 11, and person B is in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all the system needs of you is to tell it who these people are. After giving each person a tag, it can name each person in each photo, which is very important for searching images.

Most semi-supervised learning algorithms are a combination of unsupervised and supervised algorithms.

For example, Deep Belief Networks (DBNs) are based on a stack of unsupervised components called Restricted Boltzmann Machines (RBMs). Restricted Boltzmann machines are trained in an unsupervised manner, and then the entire system is fine-tuned using supervised learning techniques.

reinforcement learning

Reinforcement learning is a very different behemoth.

Its learning system (called an agent in its context) observes the environment, makes choices, performs actions, and is rewarded (or punished in the form of negative rewards).

So it has to learn on its own what is the best policy to get the biggest reward over time. A policy represents the action an agent should choose in a particular situation.

For example, many robots learn how to walk through reinforcement learning algorithms. DeepMind's AlphaGo project is also a good example of reinforcement learning. AlphaGo rose to fame in May 2017 when it defeated world champion Ke Jie in the game of Go. By analyzing millions of games and then playing chess against itself, it learned winning strategies. It should be noted that when playing against the world champion, AlphaGo is in a closed learning state, it just applies the strategy it has learned.

Another criterion for classifying machine learning systems is whether they can learn incrementally from incoming data streams.

batch learning

In batch learning, the system cannot learn incrementally—that is, it must use all available data for training. This requires a lot of time and computing resources, so it is usually done offline.

Offline learning is when the system is first trained and then put into production, at which point the learning process stops and it just applies what it has learned.

If you want a batch learning system to learn new data (e.g. new types of spam), you need to retrain a new version of the system on the full dataset (both new and old) and then deactivate the old system,

Replace it with a new system. Fortunately, the entire process of training, evaluating, and launching a machine learning system can be easily automated, so even batch learning systems can adapt to change.

The data just needs to be constantly updated and new versions of the system trained as often as needed.

Additionally, training with the full dataset is computationally expensive (CPU, memory space, disk space, disk I/O, network I/O, etc.).

If your data volume is very large, and you automatically execute the training system from scratch every day, you will end up spending a lot of money on it.

And if you are faced with massive data, you may not even be able to apply batch learning algorithms anymore.

Online Learning

In online learning, you can provide training data to the system step by step, gradually accumulating learning results.

This way of providing data can be separate, or small batches of small group data can be used for training.

Learning is fast and cheap at every step, so the system can learn from the latest data being written on the fly

For such systems - which need to receive a continuous stream of data (such as stock prices) while reacting quickly or autonomously to changes in the data stream - using online learning is a very good way to go.

Online learning is also a good option if you have limited computing resources: once a new data instance has been learned by the online learning system, it is no longer needed and you can discard it (unless you want to roll back to a previous state, and "relearn" the data), which can save a lot of space.

For very large datasets—data beyond the main memory of a single computer—online learning algorithms are also suitable (this is called out-of-core learning). The algorithm only loads part of the data at a time, trains on this part of the data, and then repeats this process until all the data is trained.

Out-of-core learning is usually done offline (that is, not on a live system), so the name online learning is misleading. We can think of it as incremental learning.

An important parameter of an online learning system is how quickly it adapts to changing data, this is known as the learning rate.

If you set the learning rate high, the system will quickly adapt to new data, but also quickly forget old data (you certainly don't want a spam filter to flag only the most recently shown emails). Conversely, if the learning rate is low, the system will have higher inertia, that is, it will learn more slowly, and it will also be less sensitive to noise in new data or sequences of atypical data points (outliers) .

A major challenge with online learning is that the performance of the system will gradually degrade if it is fed with bad data.

Now some customers of real-time systems may have noticed this phenomenon. The source of bad data could be a malfunctioning sensor on a machine, or someone spamming a search engine maliciously to improve search results rankings, etc. To reduce this risk, you need to monitor the system closely and interrupt learning (and possibly revert to previous working state) in a timely manner as soon as performance degradation is detected.

Of course, at the same time you also need to monitor the input data and respond to abnormal data (for example, using an anomaly detection algorithm).

Another way to classify machine learning systems is by how well they generalize.

Most machine learning tasks are about making predictions. This means that given the training examples, the system needs to make predictions (generalize) on examples it has not seen before.

Achieving good performance metrics on the training data is important, but not sufficient. The real purpose is to perform nicely on new object instances.

There are two main approaches to generalization: instance-based learning and model-based learning.

example-based learning

Our most common method of learning is simply rote memorization.

If you create a spam filter this way, it may only flag messages that are the exact same messages that have already been marked as spam by the user

It's not the worst solution, but it's certainly not the best either.

Besides being identical, you can also program the system to flag emails that are very similar to known spam.

A similarity measure between two emails is needed here. A (basic) measure of similarity is to count the number of words that are the same between them.

If a new message has many words in common with a known spam message, the system can mark it as spam.

This is known as instance-based learning: the system learns these examples by heart, and then generalizes to new instances by using a similarity measure to compare them with already learned instances (or a subset of them).

model-based learning

Another way to achieve generalization from a set of examples is to build a model of those examples and then use that model to make predictions. This is called model-based learning

For example, if you want to know whether money makes people happy, you can download the "happiness index" data from the OECD website, and then find the per capita GDP from the International Monetary Fund (IMF) website The statistical data of , merge the data into a table, sort by GDP per capita, and you will get a summary as shown in Table 1-1. There seems to be a trend here! Although the data contain noise (i.e. partially random), it can still be seen that life satisfaction increases more or less linearly as the country's GDP per capita increases. So you can model life satisfaction as a linear function of GDP per capita. This process is called model selection. You choose a linear model for life satisfaction with a single attribute, GDP per capita (see Equation 1-1). This model has two model parameters: θ0 and θ1. By adjusting these two parameters, this model can be used to represent any linear function, as shown in Figure 1-18. Before using the model, the values ​​of the parameters θ0 and θ1 need to be defined. How can we know what value will make the model perform best? To answer this question, we need to first determine how to measure the performance of the model. Either define a utility function (or fitness function) to measure how good the model is, or define a cost function to measure how poor the model is. For linear regression problems, the usual choice is to use a cost function to measure the gap between the linear model's predictions and the training examples, with the goal of minimizing this gap. This is exactly what the linear regression algorithm is about: through the training samples you provide, find the parameters that best fit the linear model of the provided data, which is called training the model. In this case, the optimal parameter values ​​found by the algorithm are θ0=4.85 and θ1=4.91×10-5. Confusingly, the same word "model" can refer to one type of model (e.g., linear regression), to an entirely specific model architecture (e.g., linear regression with one input and one output), or to the last A trained model that can be used for prediction (for example, a linear regression with one input and one output, using parameters θ0 = 4.85 and θ1 = 4.91×10-5).

Model selection involves choosing the type of model and fully specifying its architecture. Training a model means running an algorithm that finds the parameters of the model that best fit the training data (and hopefully make good predictions on new data).

The following is this part of the unsuccessful experiment. The output is KeyError: 'INEQUALITY'. Let's see how to modify it later when we have time.

# D:\Py-project\Python Learning\Hands-On Machine Learning\handson-ml2-master\datasets\lifesat\oecd_bli_2015.csv

# import matplotlib.pyplot as plt
# import numpy as np
# import pandas as pd
# import sklearn.linear_model
#
#
# def prepare_country_stats(oecd_bli, gdp_per_capita):
#     oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
#     oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
#     gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
#     gdp_per_capita.set_index("Country", inplace=True)
#     full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
#                                   left_index=True, right_index=True)
#     full_country_stats.sort_values(by="GDP per capita", inplace=True)
#     remove_indices = [0, 1, 6, 8, 33, 34, 35]
#     keep_indices = list(set(range(36)) - set(remove_indices))
#     return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]
# # Load the data
# oecd_bli = pd.read_csv("D:\Py-project\Python Learning\Hands-On Machine Learning\handson-ml2-master\datasets\lifesat\gdp_per_capita.csv",
#                        thousands=',', encoding='ISO-8859-1', on_bad_lines='skip')
# gdp_per_capita = pd.read_csv("D:\Py-project\Python Learning\Hands-On Machine Learning\handson-ml2-master\datasets\lifesat\oecd_bli_2015.csv",
#                              thousands=',', delimiter='\t', encoding='latin1', na_values="n/a")
# # Prepare the data
# country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
# x = np.c_[country_stats["GDP per capita"]]
# y = np.c_[country_stats["Life satisfaction"]]
# # Visualize the data
# country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
# plt.show()
# # Select a linear model
# model = sklearn.linear_model.LinearRegression()
# # Train the model
# model.fit(x, y)
# # Make a prediction for Cyprus
# x_new = [[22587]]  # Cyprus's GDP per capita
# print(model.predict(x_new))  # outputs [[ 5.96242338]]

# KeyError: 'INEQUALITY'

in short:

research data.

Choose a model.

Use the training data for training (i.e., the process by which the previous learning algorithm searches for model parameter values ​​such that the cost function is minimized).

Finally, the model is applied to make predictions on new examples (called inference), hoping that the model generalizes well.

The above is a typical machine learning project.

1.5 Major Challenges of Machine Learning

Since your main task is to choose a learning algorithm and train it on some data, the two most likely problems are "bad algorithm" and "bad data". Let's start with bad data.

1.5.1 Insufficient amount of training data

To teach a babbling child what an apple is, all you have to do is point to the apple and say "apple" (the process may need to be repeated several times), and the child will be able to identify apples of various colors and shapes,

Simply genius! Machine learning has not yet reached this stage, most machine learning algorithms need a lot of data to work properly. Even the simplest problem will likely require thousands of examples,

And for complex problems like image or speech recognition, you might need millions of examples (unless you can reuse parts of an existing model).

For complex problems, data is more important than algorithms

This idea was further promoted by Peter Norvig et al., who published the paper "The Unreasonable Effectiveness of Data" in 2009.

However, it should be pointed out that small and medium-sized data sets are still very common, and obtaining additional training data is not always easy or cheap, so don't abandon the algorithm for the time being.

1.5.2 Training data is not representative

In order to generalize well, it is crucial that the training data be very representative of the new examples to which it will generalize. Whether using instance-based learning or model-based learning

It is crucial to use a representative training set for the cases you want to generalize to. But it's easier said than done:

If the sample set is too small, there will be sampling noise (i.e. non-representative data is selected);

And even for very large sample data, if the sampling method is improper, it may also lead to a non-representative data set, which is the so-called sampling bias

1.5.3 Low quality data

Clearly, if the training set is full of errors, outliers, and noise (e.g., data produced by low-quality measurements), the system will have a harder time detecting underlying patterns and be less likely to perform well.

So taking the time to clean the training data is well worth the investment. In fact, most data scientists spend a significant portion of their time doing this work.

If some instances are clearly anomalous, it can be helpful to simply discard them, or try to fix the error manually.

If some instances are missing some features (for example, 5% of customers do not specify an age), you must decide whether to ignore these features as a whole, ignore the missing instances, and complete the missing values.

(For example, fill in the median of the age value), or train a model with this feature, and then train a model without this feature.

1.5.4 Irrelevant features

As we always say: garbage in, garbage out. Only when the training data contains enough relevant features and less irrelevant features can the system be able to complete the learning.

A key part of a successful machine learning project is extracting a good set of features for training. This process is called feature engineering and includes the following:

  • Feature selection (choosing the most useful features for training from existing features).

  • Feature extraction (combining existing features to produce more useful features - as mentioned earlier, dimensionality reduction algorithms can help).

  • Create new features by collecting new data. Now that we've seen quite a few examples of "bad data", let's look at a few examples of "bad algorithms".

1.5.5 Overfitting training data

Suppose you are traveling in a foreign country and get ripped off by a taxi driver, you could probably say that all taxi drivers in that country are robbers.

Overgeneralizing is something we humans do all too often, and unfortunately machines can fall into the same trap if we're not careful.

In machine learning, this is called overfitting, when the model performs well on the training data but fails to generalize well.

Although complex models such as deep neural networks can detect small patterns in data, if the training set itself is noisy, or the data set is too small (introducing sampling noise),

Then it is likely to cause the model to detect patterns in the noise itself. Clearly, these patterns cannot generalize to new instances.

For example, suppose you feed your life satisfaction model with many more attributes, including some uninformative ones (such as country names). In this case, a sophisticated model might detect patterns in the fact that, in the training data, countries with the letter w in their name, such as New Zealand (7.3 for life satisfaction), Norway (life satisfaction Life satisfaction degree is 7.4), Sweden (Sweden, life satisfaction is 7.2) and Switzerland (Switzerland, life satisfaction is 7.5), life satisfaction is greater than 7. When generalizing this w satisfaction rule to Rwanda or Zimbabwe (Zim-babwe), how confident are you in the results?

Clearly, this pattern in the training data is just by chance, but the model cannot tell whether the pattern is real or the result of noise. Overfitting occurs when the model is too complex for the amount and noise of the training data. Possible solutions are as follows.

  • Simplify the model: You can choose a model with fewer parameters (for example, choose a linear model instead of a high-order polynomial model), reduce the number of attributes in the training data, or constrain the model.

    • By constraining the model to make it simpler and reduce the risk of overfitting, this process is called regularization.
      • The degree of regularization applied can be controlled via a hyperparameter. Hyperparameters are parameters of the learning algorithm (not the model). Therefore, it is not affected by the algorithm itself. Hyperparameters must be set before training and remain constant during training. If you set the regularization hyperparameter to a very large value, you get an almost flat model (slope close to zero).
      • While the learning algorithm will certainly not overfit the training data, it is even less likely to find a good solution. Tuning hyperparameters is a very important part of building machine learning systems.
  • Collect more training data.

  • Reduce noise in training data (for example, fix data errors and remove outliers).

1.5.6 Underfitting training data

Underfitting and overfitting are just the opposite. It usually occurs because your model is too simple for the underlying data structure. For example, a linear model describing life satisfaction is underfitting.

The reality is far more complex than the model, so the predictions produced by the model are bound to be inaccurate even for the examples used for training. The main ways to solve this problem are:

  • Choose a more powerful model with more parameters.

  • Feed learning algorithms with better feature sets (feature engineering).

  • Reduce constraints in the model (e.g. reduce regularization hyperparameters).

There is one last important topic to cover: once a model is trained, you can't just "hope" it will generalize correctly to new scenarios, you also need to evaluate it and make some adjustments if necessary.

Now let's see how to do this.

1.6 Testing and verification

The only way to know how well a model will generalize to new scenarios is to let the model actually handle new scenarios. One way to do this is to deploy it in production and monitor its output.

This is fine, but if the model is really bad, your users will complain, so it's clearly not the best approach. A better option is to split the data into two parts: training set and test set.

You can train the model with data from the training set and test the model with data from the test set. The error rate for new scenarios is called generalization error (or out-of-sample error), and you can get an estimate of this error by evaluating your model on the test set. This estimate can tell you how well the model handles new scenarios.

If the training error is low (the model rarely makes mistakes on the training set), but the generalization error is high, then your model is overfitting the training data.

Usually 80% of the data is used for training and 20% is kept for testing. However, this depends on the size of the dataset.

If the data contains 10 million instances, then keeping 1% means including 100,000 instances as your test set may be enough to get a good estimate of generalization error.

In a famous paper in 1996, David Wolpert showed that if you make absolutely no assumptions about the data, then there is no reason to prefer one model over another, known as the No Free Lunch (NFL) theorem. For some datasets, the best model is a linear model, while for others, the best model may be a neural network model. There is no a priori model that is guaranteed to work better (hence the name of the theorem).

19 Questions - Answers

1. Machine learning is about building systems that can learn from data. Learning means getting better and better at certain tasks under a certain performance metric.

2. Machine learning is very suitable for complex problems that do not have an algorithmic answer. It can replace a series of rules that need to be manually adjusted to build systems that adapt to changing environments and ultimately help humans (for example, data mining).

3. A labeled training set is a training set that contains the desired solution (also known as a label) for each instance.

4. The two most common supervised tasks are regression and classification.

5. Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.

6. If we want the robot to learn to walk in various unknown terrains, reinforcement learning may perform best, because this is usually a typical problem to be solved by reinforcement learning.

It is also possible to formulate reinforcement learning problems as supervised or semi-supervised learning problems, but this is not a very natural idea.

7. If you don't know how to define groups, you can use a clustering algorithm (unsupervised learning) to divide customers into clusters of similar customers.

However, if you know which groups you want to have, you can feed many instances of each group to a classification algorithm (supervised learning) and classify all customers into those groups.

8. Spam detection is a typical supervised learning problem: feed the algorithm many emails and their labels (spam or not spam).

9. In contrast to batch learning systems, online learning systems are capable of incremental learning. This allows it to quickly adapt to changing data and automated systems, and to handle large volumes of data.

10. Out-of-core algorithms can handle large amounts of data that cannot fit in the computer's main memory. Out-of-core learning algorithms divide data into mini-batches and learn from these mini-batches of data using online learning techniques.

11. Instance-based learning systems strive to learn the training data by rote memorization. Then, when given a new instance, it uses the similarity measure to find the most similar instances and utilizes them to make predictions.

12. A model has one or more model parameters that determine what the model will predict given a new instance (e.g., the slope of a linear model).

A learning algorithm tries to find optimal values ​​for these parameters so that the model generalizes well to new instances. Hyperparameters are parameters of the learning algorithm itself, not of the model (e.g. the amount of regularization to apply).

13. Model-based learning algorithms search for optimal values ​​of model parameters so that the model can generalize well to new instances. We typically train such systems by minimizing a cost function,

This function measures how inaccurately the system makes predictions on the training data, adding a penalty to model complexity if the model is regularized. To make predictions, we use the model parameter values ​​found by the learning algorithm,

The features of the new instance are then fed into the prediction function of the model.

14. Some of the major challenges in machine learning are lack of data, poor data quality, underrepresentation of data, insufficient information, models that are too simple and underfit the training data, and models that are too complex and overfit the data.

15. If a model performs well on the training data, but generalizes poorly on new instances, the model may be overfitting the training data (or we got very lucky on the training data).

Possible solutions to overfitting are getting more data, simplifying the model (choosing a simpler algorithm, reducing the number of parameters or features used, or regularizing the model) or reducing the noise in the training data.

16. The test data set is used to estimate the generalization error of the model on new instances before launching the production environment.

17. The validation set is used to compare the models. This allows selection of the best model and tuning of hyperparameters.

18. The train-dev set can be used when there is a mismatch between the training dataset and the data used in the validation and test datasets (this dataset should always be as close as possible to what the model will use after it goes into production)

The train-dev set is the part of the training set (on which the model was not trained). The model is trained on the rest of the training set and evaluated on the train dev set and the validation set.

If the model performs well on the training set, but poorly on the train-dev set, the model may be overfitting the training set. If it performs well on both the training set and the train-dev set,

But it doesn't perform well on the validation set, then there might be a clear data mismatch between the training data and the validation and test data, and you should try to improve the training data so that it looks more like the validation and test data.

19. If you use the test set to tune hyperparameters, you may overfit the test set and the measured generalization error will be too optimistic (you may get a model that performs worse than expected).

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124030706