Python Machine Learning Guide from Scratch (1) - Introduction to Basic Concepts

Column overview

This column will share some learning experiences of bloggers when getting started with machine learning. I hope you can gain something from my learning experience. There may be inaccuracies and omissions, please forgive me!

When preparing this column, the blogger assumes that readers have a certain Python/programming foundation. This column will also assume that readers have some basic knowledge of machine learning (such as some basic vocabulary definitions).

What is Machine Learning/Machine Leaning/ML?

Here is a reference to the definition given by Tom Mitchell, the father of machine learning, in his textbook Machine Learning :

A computer program is said to learn from experience E and a performance measure P with respect to some class of tasks T if its performance ( measured by P ) improves with experience E.

Simply put, machine learning is concerned with designing and developing algorithms that can continuously improve themselves with experience. Every machine learning (hereinafter referred to as ML) algorithm consists of at least three parts:

  1. 任务/Task。The task refers to the goal that the algorithm wants to achieve. For example, the task of face recognition algorithms is to accurately and quickly identify and classify human facial features from visual signals.
  2. 性能/Performance。Performance refers to some measure of the effectiveness of an algorithm. Most machine learning algorithms use accuracy to determine whether the algorithm has been trained.
  3. 经验/Experience。Experience refers to what the algorithm can use to learn. For simple algorithms, an externally provided static database can be the target for learning. For more complex algorithms (such as reinforcement learning/Reinforcement Learning), its experience is generated and collected by the training object itself.

What makes ML different from other traditional computer fields is that ML algorithms are one 黑箱模型/Black-box model. When we write code ourselves, we usually write one 白箱模型/White-box model, that is, we use the model/algorithm we wrote and the input data to produce output. But the difference with ML algorithms is that we give it input and output data, and the algorithm will generate a model that fits the data based on the data. In this way, many processes that humans cannot easily understand (such as face recognition) can be learned and understood at a high level by machines through the experience brought by the data.

Machine learning process

Developing a complete ML application generally has the following process:

  1. 确定任务/Problem Definition。In this step, it is necessary to clarify what the task/Task of the ML algorithm is, and to determine which model/performance evaluation index/data type is the most effective. This step is critical because we as humans have 归纳偏置/Inductive biasto subjectively make choices and assumptions about the nature of the problem. For complex problems, such as self-driving cars, certain inductive biases (such as thinking that a 3-layer-deep random forest is the best model) can have serious consequences.
  2. 获取数据/Data Ingestion。In this step we need to obtain the data used to train the model. There are usually two ways to obtain data: the first is to obtain it yourself (such as measuring humidity and temperature every day), and the second is to obtain other people's data (the most common and simple source is Kaggle ). The data generally includes two parts: 特征/Featuresand 标签/Labels. A feature represents the input the model takes, such as an image. The label represents the expected output of the model, such as the name of the animal in this picture.
  3. 预处理数据/Data Preparation。We will inevitably encounter some meaningless/incomplete data. At this time we have to consider whether to delete them or complete them in another way. We can also do some feature engineering/Feature Engineering to make our data more enriched.
  4. 分离数据/Data Segregation。Sometimes we have to decide which data is used for training and which data is used to evaluate model performance. If a model can 训练集/Training Setachieve a high accuracy on , but 测试集/Test Seta low accuracy on , we call it that 过拟合/Overfittingthe algorithm over-understands the characteristics of the training data, causing it to perform poorly on unseen data. Vice versa, if the model has 训练集/Training Seta lower accuracy in but 测试集/Test Seta higher accuracy in , we call it 欠拟合/Underfittingthat the model has not had time to well grasp the characteristics of the training data.
  5. 训练模型/Model Training。Training models often need to be repeated many times, and different hyperparameters/Hyperparameters are selected to achieve different model performance. This step is also relatively time-consuming and computationally intensive.
  6. 评估性能/Candidate Model Evaluation。We need to constantly monitor the performance of the model to choose the most appropriate model.
  7. 应用模型/Model Deployment。Package the trained model and embed it into the application and open it to the public for testing.
  8. 持续监测性能/Performance Monitoring。Model performance can be retrained and improved with data received after release.

ML process

Often, steps three to six need to be repeated many times to find a more suitable model. Since these steps largely depend on inductive bias, it is impossible to rigorously and accurately determine the most appropriate model, super-parameters, and data. Therefore, it is nicknamed "alchemy", that is, taking the elixir (the trained model) from the alchemy furnace. ), you can only know the quality (performance) of Dan (model) by taking it out and looking at it (evaluating performance), and there is no guarantee that the Dan (trained model) you practice for the second time will be better than the first time. Some students even used the seeds of the random number generator as lottery numbers.

Big categories of machine learning

Broadly speaking, ML algorithms can be divided into three categories based on the data and learning methods they require:

  1. 监督学习/Supervised Learning/SL。This type of algorithm requires both 特征/Featuresand in the data 标签/Labels. Just like a teacher teaching students to solve a problem, they need to clearly tell the students what the correct answer is after they finish answering.
  2. 非监督学习/Unsupervised Learning/UL。Data for these algorithms is only needed 特征/Features, not required 标签/Labels. Just like a child playing with building blocks, he will place and combine the blocks according to certain rules he observes.
  3. 强化学习/Reinforcement Learning/RL。This type of algorithm does not require any data, but it does require one 训练对象/Agentthat can explore on its own and some sums 环境/Environmentit can make . For example, if an earthworm is placed in a maze with sugar and electric current, the earthworm will learn through trial and error how to obtain rewards and pass the maze.行为/Actions奖励函数/Reward Function

Preparation before starting

Having said so much, let’s briefly introduce the environment and software required for machine learning in this column.

Anaconda is a very convenient Python environment manager with many built-in functions. What we mainly use are its Jupyter Notebookfunctions. Due to the advantage that its code can be run and debugged in blocks, we can easily debug the model and add instructions. For Jupyter Notebookusage instructions, please refer to the official website tutorial . The blogger will also add the corresponding ipynb file to each blog, so stay tuned!

Conclusion

In the next blog, the blogger will introduce how to use ML methods to 普通最小二乘法/Ordinary Least Sqares/OLSimplement linear regression. If you have any questions or suggestions, please feel free to comment or send a private message. Coding is not easy. If you like the blogger’s content, please like and support!

Guess you like

Origin blog.csdn.net/EricFrenzy/article/details/131297961