The first article, the beginning of machine learning!

foreword

Recently learned machine learning. Write down your experience and study notes here, so that you can review and review in the future, and hope that these articles can help friends who want to learn machine learning. I have been studying machine learning for a while. I think the most important thing is to sort out the learning route , otherwise you will feel foggy and confused when you are studying. I also have a deep understanding of this, so in the next article I will do my best to sort out the learning route.

preparation before study

In machine learning, whether it is an image, voice, or text, it will be converted into a matrix of numbers for a series of mathematical operations, and then the results will be obtained. (mathematics is really really amazing)

math Applications in Machine Learning
advanced mathematics You can't understand the relevant formulas of machine learning that you can't learn well, and the core of machine learning is to write these mathematical formulas into codes and calculate the results of the collected data.
Linear Algebra Machine learning converts image, voice, and text data into numbers and stores them in the form of matrices.
General Theory and Mathematical Statistics Anyone who knows about machine learning knows that the results of machine learning are probabilistic results. After a deep understanding, you will find that most of the algorithms in machine learning are actually predictions based on probability.

When learning machine learning, most of the learning is done in python, so python is a language that must be known.

Third-party libraries required by python use
Numpy Packages for operations on matrices
pandas Sometimes the data is saved in the excel file, this package reads the excel file data and operates on the data
matplotlib Present the data of the evaluation of the machine learning model in the form of graphs,
scikit-learn Machine Learning Algorithm Package

Theories in machine learning are all supported by mathematical knowledge, and third-party libraries of python and above are the practice of machine learning. If you don't have mathematical knowledge, you can still use Python for machine learning practice, but without mathematical knowledge, you will not go far on the road to machine learning.

The general process of machine learning

数据收集
数据清洗
特征工程
数据建模
模型选择
训练模型
模型评估
参数调整
应用实际

data collection

Machine learning requires a lot of data. Data is actually difficult to obtain. Generally, we can use crawlers to crawl the data we need. Here we can use the mnist data set for daily training. For the mnist data set, I will write a special article to explain it.

data cleaning

Usually, the data we get generally has missing values, repeated values, and noise problems. If it cannot be directly trained, it needs to be preprocessed.
The general process of data cleaning:

step meaning
remove unique attributes Similar to the ID number, etc. cannot describe the distribution law of the sample itself
Imputation of missing values When values ​​are missing, methods such as mean imputation, mode imputation, etc. are used. Specific other methods to find by yourself
feature encoding Such as: male and female; nationality, etc. are not numerical values, and need to be coded as numerical values. Encoding method: one-of-k or one-hot is called one-hot encoding. I won’t introduce it for reasons of space. You can find it in detail by yourself.
feature binarization Convert a numeric property to a boolean type
Standardization and Regularization The types of values ​​between different attributes are different, such as: length, quality, time, etc. These data have different magnitudes, and the data with large values ​​will be more prominent in model evaluation, which will affect the parameters, correctness, and accuracy of the model. Generally, there are range standardization, z-score standardization, etc. Regularization is to transform the data into the [0,1] interval, and regularization can be regarded as a special case of standardization.

feature engineering

The meaning of feature engineering is to obtain more information from the original data, and the purpose is to obtain better training data.
The better the features you choose, the stronger the flexibility, the simpler the model you build, and the better the performance of the model. There is a saying circulating on the Internet: data and features determine the on-line of machine learning, and models and algorithms only approach this upper limit . It can be seen that feature engineering is very important for building a good model.
Feature engineering is mainly divided into three parts:

main part meaning
feature construction Artificially construct new features from raw data
feature extraction Automatically construct new and new features, and convert the original feature data into a set of features with obvious physical or statistical significance
feature selection Select a set of the most statistically significant feature subsets from the feature set, so as to achieve the effect of reducing

model selection

The types of models for machine learning fall into the following categories:

category Model
supervised learning K-Nearest Neighbors, SVM, Decision Tree, Naive Bayes, Linear Regression, Logistic Regression, etc.
unsupervised learning K clustering, dimensionality reduction algorithm, EM, etc.
reinforcement learning Markov decision, etc.

training model

There is nothing to say about model training, which is to put the data into the model algorithm you choose and let the computer train it.

model evaluation

Before talking about the evaluation, let’s talk about the division of the data set. Generally, we divide the data into a training set and a test set. As the name implies, the
training set is used to train the model. The test set is used to check the quality of the trained model. Note that the test set data must not be selected from the training set. For example, the test paper in the exam will not be the same as the practice test paper you have done, because if the test paper is the same, it will not reflect your true level.
For the division of data sets, there are generally: side-by-side method, K-fold cross-validation method.

For model evaluation, we must first understand what is overfitting and underfitting

category meaning
overfitting The trained model has a small error on the training set, but a large error on the test set. It’s just that the learning is so good that it also learns the characteristics of a certain sample, but this feature is not outstanding for all samples.
underfitting The trained model has a large error on the training set and the test set. It's just that you do so few questions that you don't do much in normal practice and normal exams.

Let me talk about the nouns often mentioned in machine learning: generalization
Generalization: refers to whether the empirical performance on the training set will show similar performance on the unknown data set. The performance is equally good, which can be considered as better generalization.
Popular explanation: Do you get good performance on the training set, does it mean you get good performance as a whole? This is generalization.

Then we can talk about the indicators of model evaluation:

category meaning
return Evaluation indicators for regression problems include square error loss function, mean absolute error, etc.
Classification Classification problems generally have a confusion matrix

Graphical evaluation tools for models: POC curve, PR curve

Parameter adjustment

When training the model, some parameters are generally set for learning, and according to the evaluation of the model, the parameters are modified to achieve the best effect of the model.

practical application

After the model is trained, it can be used in real life.

end

I think the above data cleaning and feature engineering should be classified into one category. For example, after getting the data, it is more useful to perform missing value processing on the data, deduplication, normalization/standardization, hot encoding, and finally feature extraction. features, select more representative features for training and learning.

在刚开始的时候,我信心满满的认为凭自己的学到的东西,对于机器学习的第一章能够完整的,详细的给大家讲解清楚基础的知识点。但是越写越发现,自己无法详细的去讲解基础的知识。因为它太多,太复杂了,就像故事宏大的小说一样,它太大了,以至于脱离了作者的掌控。最具有代表的就是模型评价那一部分,那一部分的东西很多,也很复杂,我只能给出大概的知识点,更详细的只能靠大家自己去查找了。

虽然机器学习的开篇第一章,令自己不太满意,但我感觉我还是把机器学习的流程完整的写了出来,有一个清晰的思路是很重要的。接下来数据的收集,清洗以及特征工程不是我们学习的重点,我们要将重心放在模型的选择以及训练上。
我会讲解这些模型算法的理论知识以及案例。

机器学习的开篇没有写好,我会吸取教训,在以后模型算法理论中我会尽量讲的详细,让大家尽量有所收获。最后创作不易,希望大家多多支持。

Guess you like

Origin blog.csdn.net/m0_59151709/article/details/129636976