Machine learning practice project

leading

For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/

The data set and code used below can be downloaded from here " Data Set "


Machine learning is a technology that generates rules and discovers models from data to help us predict, judge, analyze and solve problems.

A machine learning project is roughly divided into 5 steps from the beginning to the end, which are defining the problem, collecting data and preprocessing, selecting the algorithm and determining the model, training the fitting model, evaluating and optimizing the model performance . These 5 steps are a cyclic and iterative process, you can refer to the following picture:

image-20220623084223733

All projects will be done in accordance with these 5 steps, which I refer to as the 5 steps of actual combat. In order to give you a deeper understanding of these 5 steps and get started faster later, I will take you through a project, and I will give you a clear explanation of the purpose and rationale behind each step. First of all, let's get the first two steps together, which is to define the problem and pre-process the data.

Step 1: Define the problem

Let's look at the first step, defining the problem. In the process of defining the problem, we need to analyze the business scenario, set clear goals, and at the same time clarify which type of machine learning the current problem belongs to. If we don't understand these, we won't be able to choose a model later.

So first of all, we have to understand the business scenario of our project. Suppose you have joined the operation department of "Easy Flowers" and are analyzing the operational efficiency of promotional copywriting for WeChat public accounts. You have collected a lot of soft article data, including the number of likes, reposts, page views, etc., like this:

image-20220623084810638

Because after the WeChat official account has more than 100,000 views, its specific reading volume cannot be displayed. So in response to this problem, the goal of our project is to build a machine learning model to estimate how many views an article can achieve based on indicators such as the number of likes and reposts.

Because it is necessary to estimate the number of views, in this data set: the number of likes, the number of reposts, the popularity index, and the rating of articles, these four fields are all features, and the number of views is the label. Here we already have labels to estimate, so this is a supervised learning problem. Plus our labels are continuous values, so it's a regression problem.

It is not difficult to see that in this data set, there is a clear correlation between features and tags: when there are many likes and retweets, there are often more pageviews. But by which specific function can this correlation be depicted? We don't know yet, so our task in this project is to find this function.

Step 2: Collecting data and preprocessing

"Data collection and preprocessing" will appear in all machine learning projects, and its role is to provide good fuel for machine learning models. The better the data, the better the model will run. This step seems to be only one sentence, but it actually contains several small steps. In a complete way, there are 6 steps:

  • Data collection;
  • data visualization;
  • data cleaning;
  • feature engineering;
  • Build feature sets and label sets;
  • Split training set, validation set and test set.

You may not understand the meaning of these 6 steps at first glance. Don't worry, I will continue to explain one by one in conjunction with the "Easy Flowers" project.

(1) Data collection

The first step is to collect data, which is also called data collection. In our project, I have already made it, you can download the ready-made dataset here.

However, in reality, collecting data is usually very hard. It is necessary to do a lot of data burying in the operation link, obtain behavior information such as user consumption and interest preference information, and sometimes need to crawl the Internet to obtain data.

With the data set, the next thing we have to do is data visualization, that is, to observe the data through visualization, and find out the feeling for choosing a specific machine learning model.

(2) Data visualization

Data visualization is a panacea skill that can do a lot. For example, you can look at the possible relationship between features and labels, and you can also see if there are "dirty data" and "outliers" in the data. If you want to learn data visualization in depth, it is recommended to read this article " Learn Python Data Visualization at a Time ".

However, before formal visualization, we need to import the collected data into the operating environment. For data import, we need to use the Pandas data processing toolkit. This package is a powerful tool for manipulating data, and we will use it in every project in the future. Now we use the import statement to import it, and then, we read the data set of this project into the Python runtime environment through the following code, and present it in the form of DataFrame:

importpandasaspd	# 导入Pandas数据处理工具包

df_ads=pd.read_csv('.\data\易速鲜花微信软文.csv')	# 读入数据
df_ads.head()	# 显示前几行数据

DataFrame is a common two-dimensional tabular data structure in machine learning. In the above code, I use read_csv()the API to read the data file in CSV format into the DataFrame of Pandas, and name it df_ads. This code outputs the following:

image-20220623090557853

This completes the import of data, and then we can officially enter the "visualization". Based on experience, we guess that there is most likely a linear relationship between "likes" and "views". Is that really the case? We can draw a graph to verify it.

In this "verification" link, we need to use two packages: one is the Python drawing tool library "Matplotlib", and the other is the statistical data visualization tool library "Seaborn". These two packages are essential toolkits for Python data visualization. They are part of the default Anaconda installation package and do not require pip installrepeated installation of statements.

When importing these two packages, we still use the import statement. Please note that in order to save the amount of code and speed up the operation, I did not import the complete matplotlib package, but only imported the drawing module pyplot in the matplotlib package, because the linear relationship can be easily verified with a scatter plot . So let's use the API in the matplotlib package plot()to draw a scatter diagram between "likes" and "views" to see their distribution status:

# 导入数据可视化所需要的库
importmatplotlib.pyplotasplt	# Matplotlib–Python画图工具库
importseabornassns	# Seaborn–统计学数据可视化工具库

# 方法一
plt.figure(dpi=500)	# 设置图形的清晰度
plt.plot(df_ads['点赞数'],df_ads['浏览量'],'r.',label='Trainingdata')	# 用matplotlib画图
plt.xlabel('点赞数')	# x轴Label
plt.ylabel('浏览量')	# y轴Label
plt.legend()	# 显示图例
plt.show()	# 显示绘图结果

# 方法二
plt.figure(dpi=500)	# 设置图形的清晰度
plt.scatter(df_ads['点赞数'],df_ads['浏览量'],marker='o')
plt.xlabel('点赞数')	# x轴Label
plt.ylabel('浏览量')	# y轴Label
plt.show()

The output result is shown in the figure below:

image-20220623092228010

From this picture, we can see that these data are basically concentrated near a line, so there seems to be a linear relationship between its labels and features, which can provide reference information for us to choose models in the future.

Next, I'm going to draw a boxplot using Seaborn's boxplot tool. Let's see if there are any "outliers" in this data set. I chose the feature of heat index randomly here, you can also try to draw boxplots for other features.

importpandasaspd
importmatplotlib.pyplotasplt
importseabornassns

# 用seaborn画箱线图
plt.figure(dpi=500)	# 设置图形的清晰度
data=pd.concat([df_ads['浏览量'],df_ads['热度指数']],axis=1)	# 浏览量和热度指数
fig=sns.boxplot(x='热度指数',y="浏览量",data=data)	# 用seaborn的箱线图画图
fig.axis(ymin=0,ymax=800000);	# 设定y轴坐标

The following figure is the output boxplot:

image-20220623110105495

The boxplot is composed of five numerical points, namely the minimum value (min), lower quartile (Q1), median (median), upper quartile (Q3) and maximum value (max). In statistics, this is called a five-number generalization. These five values ​​can clearly show us the distribution and dispersion of the data.

In this figure, the lower quartile, median, and upper quartile form a "box with compartments", which is the so-called box; an extension line is established between the upper quartile and the maximum value, which is The so-called line is also called "whisker"; the two poles of the whisker are the minimum and maximum values; in addition, the box plot will also draw the outlier data points separately.

In the above boxplot, it is not difficult to find that the higher the popularity index, the greater the median number of page views. We can also see that there are some outlier data points, which are much more viewed than other articles. These "outliers" are what we call "hot articles".

At this point, the data visualization work is basically completed. After data visualization, the next step is data cleaning.

(3) Data cleaning

Many people compare data cleaning to "washing vegetables" before "frying vegetables", which means that the cleaner the data, the better the effect of the model. The cleaned data is generally divided into 4 situations:

  • The first is to deal with missing data: if there is missing data in the backup system, we try to make up for it; if not, we can eliminate the missing data, or use the average value, random value or 0 of other data records value to complement. This replenishment process is called data restoration.

  • The second is to deal with duplicate data: if it is exactly the same duplicate data, just delete it. But if there are two different rows of data in the same primary key, for example, there are two different addresses behind the same ID number, we need to see if there is any other auxiliary information that can help us judge (such as time stamp), if we cannot judge If so, they can only be randomly deleted or kept completely.

  • The third is to process wrong data: For example, if the sales volume and sales amount of a product have a negative value, it needs to be deleted or converted into a meaningful positive value. Another example is a field representing a percentage or probability. If the value is greater than 1, it also belongs to logical error data.

  • The fourth is to deal with unavailable data: This refers to the format of organizing data. For example, some commodities are in RMB and some are in US dollars, so they need to be unified first. Another common example is converting "yes" and "no" into "1" and "0" values ​​to feed into a machine learning model.

So how to see if there is dirty data in the data set?

As far as the data set of our project is concerned, if you are careful, you may have found in the DataFrame diagram that the value of "Number of Forwarding" in the data with row index 6 is "NaN", which means NotANumber. In Python, it represents a value that cannot be represented nor manipulated. This is typical dirty data.

image-20220623111555170

We can isna().sum()count the number of all NaNs through the function of DataFrame. In this way, we can look at the number of NaN occurrences while looking at whether there are NaNs. If there are too many NaNs, it means that the quality of the data set is not good, and it is necessary to find out what is wrong with the data source.

df_ads.isna().sum()	# NaN出现的次数

The output is as follows:

Number of Likes 0
Reposts 37
Popularity Index 0
Article Rating 0
Views 0
dtype:int64

The output shows that there are 37 NaN values ​​in the field "Number of Retweets" in our dataset. For a data set of thousands of records, this is not a lot. So how to deal with it? It's also very simple. We can use dropna()this API to delete data rows with NaN.

df_ads=df_ads.dropna()	# 把出现了NaN的数据行删掉

You may think that we just found outliers (popular articles) through boxplots. Are these dirty data? That's a good question, and it doesn't have a solid answer.

After removing the outliers, the model will fit more beautifully to ordinary data. But in real life, there are such outliers that make the model not so beautiful. If you remove the outliers here, the model won't work as well. Therefore, this is a process of balance and trade-offs.

We can train a model that includes these outliers, and a model that does not, and compare them. Here, I propose to keep these "outliers".

Now, we have completed the simple cleaning of this data.

(4) Feature engineering

Feature engineering is a specialized subfield of machine learning, and I think it is the most creative part of the data processing process. Whether feature engineering is done well or not will greatly affect the efficiency of machine learning models.

Let me give an example to explain what feature engineering is. Do you know what is BMI index? It is equal to the weight divided by the square of the height, which is a feature engineering.

What does that mean? That is to say, after this process, the BMI index replaces the original two characteristics-weight and height, and can completely describe our body shape objectively.

Therefore, after this feature engineering, we can use the BIM index as a new feature to input into the machine learning model for health assessment.

You may ask what are the benefits of doing this? Taking BMI feature engineering as an example, it reduces the dimensionality of feature datasets. Dimensions are the number of features in a dataset. It should be known that in the data set, for each additional feature, the feature space of the model fitting will be larger, and the amount of calculation will be larger. Therefore, discarding redundant features and reducing the dimensionality of features can make machine learning models train faster.

This is just one of the many benefits of feature engineering. In addition, feature engineering can better represent business logic and improve the performance of machine learning models.

Since the problem of our project is relatively simple and the requirements for feature engineering are not high, we will not do feature engineering here for the time being.

(5) Construct feature set and label set

We said that features are the individual data points that are collected, variables that are fed into the machine learning model. A label is something to predict, judge, or classify. For all supervised learning algorithms, we need to input two sets of data "feature set" and "label set" into the model. Therefore, before starting to build a machine learning model, we need to build a feature dataset and a label dataset.

The specific construction process is also very simple, we only need to delete the data we don't need from the original data set. In this project, the features are the number of likes, retweets, popularity index and article rating, so it is only necessary to delete "views" from the original data set.

X=df_ads.drop(['浏览量'],axis=1)	# 特征集,Drop掉标签相关字段

And the label is the pageview we want to predict, so we keep only the "viewview" field in the labels dataset:

y=df_ads.浏览量	# 标签集

Let's take a look at what data is in the feature set and label set.

X.head()	# 显示前几行数据
y.head()	# 显示前几行数据

Because a Notebook cell can only have one output. The actual operation needs to put the code that displays the two data in different cells. Their output results are shown in the figure below:

image-20220624083649721

image-20220624083715689

It can be seen that, except for pageviews, all other fields are still in the feature dataset, and only pageviews are saved in the label dataset, that is to say, the original dataset is split into machine learning feature sets and Label set.

Unsupervised learning algorithms do not need this step, because unsupervised algorithms do not have labels.

However, after the original data set is split vertically from the column dimension into feature sets and label sets, it needs to be further split horizontally from the row dimension.

Because machine learning does not end with finding a model through the training data set, we need to use the verification data set to see if the model is good, and then use the test data set to see if the model can be used on new data.

(6) Split training set, verification set and test set

Before splitting, let me explain that for learning projects, in order to simplify the process, the verification process is often omitted. Our project today is relatively simple, so we also omitted the verification, and only split the training set and test set, and the test set at this time has dual functions of verification and testing.

When splitting, the proportion of data reserved for testing is generally 20% or 30%. However, if your data volume is very large, such as more than 1 million, then you don't necessarily have to keep so much. Generally speaking, tens of thousands of test data are enough. Here I will split the data according to the ratio of 80/20. For specific splitting, we will use the dataset splitting tool in the machine learning toolkit scikit-learn train_test_splitto complete:

# 将数据集进行80%(训练集)和20%(验证集)的分割
fromsklearn.model_selectionimporttrain_test_split	# 导入train_test_split工具
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

Please note here that although it is a random split, we need to specify a random_state value, so as to ensure that the program splits the same training set and test set every time it runs. If the training set and the test set are split differently each time, then the comparison of the pros and cons of the model before and after parameter tuning will lose the fixed standard.

Now, the split of training set and test set is also completed, and you will find that the original data has now become four data sets, namely:

  • Feature training set (X_train)
  • Feature test set (X_test)
  • Label training set (y_train)
  • Labeled test set (y_test)

Now we are done, defining the problem and the collection and preprocessing of the data. Without clearly defining the problem to be solved, we cannot choose a model in a targeted manner.

Although the collection and preprocessing of data does not seem to be as "eye-catching" as the selection and optimization of models, it is actually the key to the success or failure of machine learning projects. This step can be divided into 6 small steps in the following figure:

image-20220624084845121

In these 6 steps, especially data visualization and feature engineering, because there is no fixed method to follow, it is a test of experience. It is not only the process of finding a feeling for the existing data, but also the next step before "feeding" the data to the model. necessary preparations.

In addition, there are two more points to note:

  • The first point is that the order of these sub-steps is not fixed, such as data visualization and feature engineering, in many cases, it is visualized first, some ideas of feature engineering are discovered, then feature engineering is done, and then visualized again. And some feature engineering, such as feature scaling, must be done after splitting the data;

  • Second, these sub-steps may not all be used in a particular machine learning project. For example, unsupervised learning projects do not need to create feature sets and label sets, and generally do not need to use validation sets and test sets.

Step 3: Choose an Algorithm and Build a Model

In this step, we need to select an appropriate algorithm based on the relationship between features and labels, find an appropriate algorithm package corresponding to it, and then build a model by calling this algorithm package.

The process of selecting an algorithm is a test of the experience of data scientists. Specifically in our project, we can know from the above that there is an approximately linear relationship between some features and labels in our data set. Moreover, the labels of this data set are continuous variables, so it is suitable to use regression analysis to find the predictive function from features to labels.

The so-called regression analysis is a statistical analysis to determine the quantitative relationship between two or more variables that depend on each other. To put it bluntly, it is to study how the dependent variable changes when the independent variable changes. Used to predict passenger flow, rainfall, sales, etc.

However, there are many algorithms for regression analysis, such as linear regression, polynomial regression, Bayesian regression, etc., so which one should you choose? In fact, this is determined according to the relationship between features and labels.

In the previous visualization process, we speculated that there may be a linear relationship between features and labels, and simply verified it with a scatter plot. So here, we choose to use the linear regression algorithm to model.

The linear regression algorithm is the simplest and most basic machine learning algorithm. It is actually the process of finding parameters for each characteristic variable. I think you must be familiar with the formula of unary linear regression:

ml1

For unary linear regression, its internal parameters are the unknown slope and intercept. It’s just that in machine learning, we call the slope a the weight (weight), which is represented by the English letter w, and the intercept b is called the bias (bias), which is represented by the English letter b. So the formula of unary linear regression in machine learning is also written as:

ml2

In our project, there are 4 features in the data set, so it is:

ml3

Therefore, our model will have 5 internal parameters, that is, the weight of 4 features and a bias (intercept) that need to be determined. However, the specific code implementation of these formulas does not need to be done by ourselves, they are all encapsulated in the toolkit. You just need to have an impression of how the algorithm works.

After determining the algorithm, let's take a look at what kind of algorithm package to call to build a model.

For machine learning, the most commonly used algorithm toolkit is scikit-learn, or sklearn for short, which is the most widely used open source Python machine learning library and can be called a machine learning artifact. sklearn provides a large number of machine learning tools for data mining, covering data preprocessing, visualization, cross-validation and various machine learning algorithms.

Although we have chosen to use the linear regression algorithm, there are many linear regression algorithm packages in sklearn, such as the basic linear regression algorithm LinearRegression, and the Lasso regression and Ridge regression derived from it.

So which one is the algorithm package suitable for our project? In fact, our general method of selecting an algorithm package is to start with the simplest algorithm that can solve the problem until a satisfactory result is obtained. For this project, we choose LinearRegression, which is also the most common and fundamental regression algorithm package in machine learning. I will introduce you to other regression algorithm packages slowly in the future.

Calling LinearRegression to build a model is very simple, just two lines of code:

fromsklearn.linear_modelimportLinearRegression	# 导入线性回归算法模型
# 类似于创建一个Python类
linereg_model=LinearRegression()	# 使用线性回归算法创建模型

As you can see, I named this linear regression model "linereg_model". So here, are we building a good model? Yes, the model is created and we can start training it. However, one thing needs to be pointed out that when building a model, you usually need to know what external parameters it has, and at the same time specify the values ​​of its external parameters.

There are two types of parameters in the model: internal parameters and external parameters. Internal parameters are part of the algorithm itself, and we don’t need to manually determine them. The weight w and intercept b mentioned just now are internal parameters of the linear regression model; external parameters are also called hyperparameters, and their values ​​are used to create the model. is set by ourselves.

For the LinearRegression model, its external parameters mainly include two Boolean values:

  • fit_intercept, the default value is True, which represents whether to calculate the intercept of the model.
  • normalize, the default value is False, which represents whether to normalize the feature X before regression.

However, for relatively simple models, the default external parameter settings are also good choices, so it is also possible for us to directly call the model without explicitly specifying external parameters. In the above code, I just used the default value of the external parameter directly when creating the model.

Step 4: Train the model

Training the model is to use the feature variables and known labels in the training set to gradually fit the function according to the loss of the current sample, determine the optimal internal parameters, and finally complete the model. Although it looks complicated, we all complete these steps by calling the fit method.

The fit method is the core link of machine learning, which encapsulates many specific machine learning core algorithms. We only need to pass the feature training data set and label training data set as parameters into the fit method at the same time.

linereg_model.fit(X_train,y_train)	# 用训练集数据,训练机器,拟合函数,确定内部参数

The output after running this statement is as follows:

LinearRegression()

In this way, we have completed the training of the model. You may find it very strange, since training the model is the core part of machine learning, why is there only one line of code? In fact, this is what I have repeatedly emphasized. Due to the existence of excellent machine learning libraries, we can use one or two lines of statements to achieve very powerful functions. So, don't underestimate the simple fit statement above, this is the key process for the model to learn by itself.

In this process, the core of fit is to reduce the loss, so that the function's simulation of features to labels becomes more and more appropriate. So how does it specifically reduce losses? Here I drew a picture to show the process of the model from being very unreliable to more reliable.

image-20220624091938202

This fitting process is also a process in which the machine learning algorithm optimizes its internal parameters. The key to optimizing parameters is to reduce the loss.

So what is loss? It's actually a penalty for bad predictions and a measure of how good the model is. The loss is the error of the model, also known as cost or penalty. Although there are many names, they all have the same meaning, which is the embodiment of the gap between the current predicted value and the actual value. It is a numeric value indicating how accurate the model's predictions are for a single sample. If the model's predictions are perfectly accurate, the loss is 0; if not, there is a loss.

In machine learning, what we are after is of course a relatively small loss. However, whether the model is good or not, we can't just look at a single sample, but also find a set of function models with a "smaller" average loss for all data samples. The loss size of the sample can basically be understood as the geometric distance between the predicted value and the true value in a geometric sense. The larger the average distance, the larger the error and the more outrageous the model. In the figure below, the model with a larger average loss is on the left, and the model with a smaller average loss is on the right. The average loss of all data points in the model is obviously greater than that of the model on the right.

image-20220624092019163

Therefore, for each set of different parameters, the machine will use the loss function to calculate the average loss based on the sample data set. The optimization process of machine learning is the process of gradually reducing the loss on the training set. Specific to the fitting of our regression model today, its key link is to gradually optimize the parameters of the model through gradient descent to minimize the error value of the training set. This is the optimization process to be achieved by the fit statement we just talked about.

Therefore, for each set of different parameters, the machine will use the loss function to calculate the average loss based on the sample data set. The optimization process of machine learning is the process of gradually reducing the loss on the training set. Specific to the fitting of our regression model today, its key link is to gradually optimize the parameters of the model through gradient descent to minimize the error value of the training set. This is the optimization process to be achieved by the fit statement we just talked about.

In this, the method of calculating the error in linear regression is well understood, which is the sum of squared residuals between the true value and the predicted value in the data set. What about gradient descent? In order to let you understand intuitively, I use a picture to show how the gradient descent goes step by step to the minimum loss point in the loss curve.

image-20220624092118250

Just like in the picture, gradient descent is actually the same as going down a mountain. You can imagine that when you stand high, your goal is to find a series of parameters that minimize the loss on the training data set. So where should you go to ensure the minimum loss value? The key is to find the direction of each step through the method of derivation, and ensure that you always move in the direction of smaller losses.

So, you can see how important direction is. The reason why machine learning optimization can fit the best model is because it can find the way forward. You see, not only we humans need direction, but even AI needs the right direction.

So far, we have completed the establishment and training of the model. Next, let’s take a look at how to evaluate and optimize the trained model so that it can estimate the number of article views as accurately as possible.

Step 5: Model Evaluation and Optimization

We just said that gradient descent is to minimize the error when fitting the model with the training set. At this time, the algorithm adjusts the internal parameters of the model. In the process of evaluating the effect of the model on the verification set or test set, we optimize the hyperparameters (external parameters of the model) by minimizing the error.

In this regard, machine learning toolkits (such as scikit-learn) will provide commonly used tools and indicators to evaluate the validation set and test set, and then calculate the current error. For example, the R square or MSE mean square error index can be used to evaluate the pros and cons of the regression analysis model.

However, before starting to evaluate the model, I would like to ask you to think about it: In our 5 practical steps, there is no link of "using the model to predict pageviews". Why is this? In fact, this link has been included in step 5 "Evaluation and optimization of model performance", and it is the first step we will implement in step 5.

Specifically, in the step of "model evaluation and optimization", after we predict the page views of the test set, we need to compare the predicted result with the existing true value of the test set, so that we can obtain performance of the model. And this whole process is also a cyclic and iterative process. I have summarized this cyclic process into the following figure, you can take a look:

image-20220624092454265

For this project, to predict the page views of the test set, you only need to use the predict method in the trained model linereg_model to make predictions on X_test (feature test set), and this method will return the prediction result of the test set.

y_pred=linereg_model.predict(X_test)#预测测试集的Y值

In almost all machine learning projects, you can use the predict method to make predictions. It uses the model to predict the true value on any data set of the same type. It can be applied to validation sets, test sets, and of course the training set itself.

Here I should clarify that to simplify the process, we don't really have multiple loops of validation and testing. Therefore, in this project, X_test acts as both a test set and a validation set.

After getting the prediction results, we use the following code to combine the original feature data of the test data set, the true value of the original label, and the predicted value of the model to the label for display and comparison.

df_ads_pred=X_test.copy()	# 预测集特征数据
df_ads_pred['浏览量真值']=y_test	# 测试集标签真值
df_ads_pred['浏览量预测值']=y_pred	# 测试集标签预测值
df_ads_pred	# 显示数据

The output is as follows:

image-20220624093311830

It can be seen that the predicted value of pageviews is relatively close to the true value. And for some articles, the prediction of this model is very accurate. For example, the data numbered No. 145 has an actual pageview volume of 119,501 and a predicted pageview volume of 110,710. That's a great result.

If you want to see what the current model looks like? You can print out the weight of each feature and the bias of the model through the coef_ and intercept_ properties of LinearRegression. They are also the internal parameters of the model.

print("当前模型的4个特征的权重分别是:",linereg_model.coef_)
print("'当前模型的截距(偏置)是:",linereg_model.intercept_)

Enter as follows:

The weights of the 4 features of the current model are: [48.0839522434.7306222929730.133124892949.62196343]
The intercept (bias) of the current model is: -127493.90606857173

That is to say, the linear regression formula of our current model is:

y=48.08x1 (like) + 34.73x2 (forward) + 29730.13x3 (hotness) + 2949.62x4 (rating) −127493.91

But here, the whole machine learning project is not over, and we will finally give the evaluation score of the current model:

print("线性回归预测评分",linereg_model.score(X_test,y_test))#评估模型

In machine learning, there are two indicators commonly used to evaluate regression analysis models: R-square score and MSE indicator, and related tools are provided in most machine learning toolkits. You don't need to know too much about this, you just need to know that in our score API here, the R-square score is used to evaluate the model.

In the end we get this result:

Linear regression predicted score 0.740552064611524

It can be seen that the R-squared value is about 0.74. So what does this mean?

Generally speaking, the value of R square is between 0 and 1, and the larger the R square, the better the regression model fitted. Now we get an R-squared value of about 0.74, and we can't actually be sure if it's satisfactory without comparing it to other models.

Because the score is high or low, it is related to the difficulty of data set prediction, model type and parameters. Moreover, the R-square score is not the only evaluation criterion for linear regression models.

But what you need to know is that if the evaluation score of the model is not ideal, we need to go back to step 3, adjust the external parameters of the model, and retrain the model. If the results obtained are still unsatisfactory, then we have to consider choosing other algorithms and creating a new model. If unfortunately, the effect of the new model is still not good, we have to go back to step 2 to see if there is a problem with the data.

image-20220624094216802

This is why, I have always emphasized that machine learning projects are an iterative process, and excellent models are the product of iterations.

When the model passes the evaluation, you can solve practical problems, and the machine learning project is basically over.

Summarize

Through a practical project to predict the number of views of soft articles, we learned that machine learning projects go through 5 steps. The first step is to clarify our project goals by defining the problem; the second step is data collection and preprocessing. The focus of this step is to convert the data into a format that can be processed by machine learning, so that we can use it in the third step Select the appropriate algorithm for the problem to build the model.

After we have the model, we need to train the model and fit the function in the next fourth step. Finally, evaluate and optimize the trained model. For this last step, which is step 5, our focus is to iterate the evaluation, find the optimal hyperparameters, and determine the final model.

Guess you like

Origin blog.csdn.net/qq_43300880/article/details/125441756