Machine learning notes (5) regression model

1. Linear regression model

1. The generalized linear model
is the linear combination of x and y:
y = w 1 x 1 +w 2 x 2 …+w n x n +b
coef_ is the coefficient matrix w =[w1,w2…wn], intercept_ is the intercept
2, the ordinary least squares method
fits a linear model with coefficients w = (w_1, …, w_p), so that the residual sum of squares between the actual observation data and the predicted data (estimated value) of the data set The smallest.
Insert picture description here
Ordinary least squares depends on the correlation independence of the data, that is, each column of the matrix is ​​related, and the design matrix tends to be a singular matrix. This characteristic makes the least squares estimation very sensitive to random errors and may produce large variance. In our real operation, the data that may be collected is highly correlated and will seriously affect the results.
2.1. Use function descriptions.
Use function descriptions to copy Zhihu Miss Apple ye

Least squares linear regression: sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
Parameters:
1. fit_intercept: boolean, optional, default True. Whether to calculate the intercept, the default is calculated. If you use centralized data, you can consider setting it to False,
regardless of the intercept. Note that this is a consideration, generally the intercept should be considered.
2. Normalize: boolean, optional, default False. Standardization switch, closed by default; this parameter is automatically ignored when fit_intercept is set to False. If it is
True, the regression will standardize the input parameters: (XX mean)/||X||, of course, it is recommended to put the standardization work before training the model; if it is False
, you can use sklearn before training the model .preprocessing.StandardScaler performs standardized processing.
3. copy_X: boolean, optional, default True. The default is True, otherwise X will be rewritten.
4. n_jobs: int, optional, default 1int. The default is 1. When -1, all CPUs are used by default?? (this parameter is to be tried).
Attributes:
coef_: array, shape(n_features,) or (n_targets, n_features). Regression coefficient (slope).
intercept_: intercept
method:
1, fit(X,y,sample_weight=None)
X:array, sparse matrix [n_samples,n_features]
y:array [n_samples, n_targets]
sample_weight:array [n_samples], the weight of each test data is also passed in as a matrix (sample_weight is added after version 0.17).
2. predict(x): prediction method, will return the value y_pred
3. get_params(deep=True): return the set value of the regressor
4. score(X,y,sample_weight=None): the score function, will return a value less than 1 The score of may be less than 0

1.2, code

from sklearn import linear_model
import numpy as np
reg = linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
x = np.array([[1,2], [2,3], [2, 4]])
y = np.array([1,2,3])
reg.fit(x,y)
print(reg.predict([[4,6]]))

2. Random forest

1. Random forest
Random forest is simply bagging+ decision tree, which is a forest composed of multiple decision trees .
2. The establishment
of decision tree Common decision tree algorithms:

  • ID3, the decision tree built using the information gain.
  • C4.5, the decision tree established using the information gain rate.
  • CART, Gini coefficient.

The greater the information gain, the stronger the feature's ability to reduce sample entropy (indicating the degree of confusion).
3. The use of bagging
Extract data from the original data using the method of replacement (Bootstraping), randomly extract n samples, perform k rounds, obtain k sample sets, and establish k models (in the case of random forest, it is a decision tree) For the classification problem, we use the voting method to determine the result, and for the regression problem, the final result is obtained according to the weighted average method.
ps: The number of features selected by a random forest is different each time, and each tree does not necessarily have all the features.
4. The parameters that we need to adjust to call the api are

  • Number of decision trees
  • Number of characteristic attributes
  • Depth of decision tree

5. Code display (python)
(1) Implemented with scikit-learn:

#1、读取数据
import pandas as pd
import numpy as np

data = pd.read_csv(r'C:\Users\13056\Documents\Tencent Files\1305638814\FileRecv\123.csv', sep = ',',encoding="gbk")

X = np.array(data)[:748:,-3:-1]
y = np.array(data)[:748:,-1:]
x_train = X[:-200]
x_test = X[-200:]
y_train = y[:-200]
y_test = y[-200:]
title = ['1','2',"3"]
#一共750个样本,每个样本有3个特征

#2、随机森林
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import matplotlib.pyplot as plt

bt = RandomForestClassifier(n_estimators=2, max_depth=None)
'''
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, 
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’,
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, 
n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
n_estimator是我们确定的决策树的数目
criterion建立决策树的方法,默认gini
max_depth:树的最大深度
min_samples_split分割内部节点所需要的最小样本数量
min_samples_leaf:需要在叶子节点上的最小样本数量
min_weight_fraction_leaf:一个叶子上的最大权重总和
random_state :使用随机生成树生成种子
'''
y_train *= 1000
bt.fit(x_train,y_train.astype('int'))
preds = bt.predict(x_test)/1000
plt.scatter(y_test,preds,alpha=0.5,s=6)
metrics.r2_score(y_test,preds)

Click on

Guess you like

Origin blog.csdn.net/weixin_45743162/article/details/114250563