Regression analysis and forecasting Technical Overview

The basic concept of regression analysis is a method to predict another variable with a group of variables. Popular point of view, that is another thing to predict the probability of occurrence based on the degree of correlation of several things. The purpose of regression analysis is to find the optimal model for a contact input variables and output variables.

There are many ways regression, may be classified by three methods: number of arguments, the type and shape of the regression line of the dependent variable.

1) based on the correlation between the number of independent variables in different classification, regression method can be divided into one regression analysis and multivariate regression analysis. A regression analysis, the argument is only one, and in the multiple regression analysis method, there are two or more independent variables.

2) according to the type of the dependent variable, linear regression method can be divided into non-linear regression analysis and regression analysis.

3) When classified according to the shape of the regression line, if the regression analysis including only one independent variable and a dependent variable, and the relationship between the two approximated straight line is available, this regression is a linear regression analysis is called; if Regression analysis comprises two or more independent variables, and due to a nonlinear relationship between the dependent and independent variables, the multivariate linear regression analysis is called.

1. Linear regression

Linear regression is one of the world's most famous modeling. Linear regression, the data is modeled using a linear prediction function, and the unknown model parameters are estimated from the data. These models are called linear model. The linear model, the dependent variable is continuous, the argument may be a continuous or discrete, linear regression.

1) a linear regression

The purpose of regression analysis is to find the optimal model for a contact input variables and output variables. More precisely, regression analysis is to identify one or more variable Y correlation between variables X and process.

Or Y is typically called the dependent variable in response to the output, X called input, the amount of return, explanatory variables or arguments. Linear regression most suitable straight line (regression line) to establish the relationship between the dependent variable Y and one or more independent variables X, as shown in FIG. It can be expressed as follows.

Y = a+b x X+e

Wherein, a is the intercept, b is the slope of the regression line, e is the error term.

To find the regression line is to determine the regression coefficients a and b. Assuming that the variance of the variable y is a constant, you can use the least squares method to calculate these coefficients, to minimize the error between the actual data points and the regression line is estimated that only the minimum error derived parameters do, what we need most parameters. The residual sum of squares is often referred to as square error and the regression line, represented by SSE, as follows.

A linear regression
FIG 1 a linear regression

And 2 is in all samples, the squared error of the regression line in FIG    point value on the regression line in the   sum of the squares of the difference.

Error squares regression line and a schematic
The square error and the regression line of FIG. 2 schematically

2) Multiple Linear Regression

Multiple linear regression is a linear regression of the extension unit, to a plurality of predictor variables. Modeling the linear response variable Y is a function of several variables prediction can be predicted by a linear combination of properties, the basic form is as follows.

Explanatory linear regression model is very strong, weight vector model is very intuitive to express the importance of each attribute in the forecast sample. For example, to predict whether it will rain today, based on historical data and have learned a model of weight vector and intercept it you can consider each property to determine if it will rain today.

Wherein, X1 represents the wind, X2 represents humidity, X3 denotes an air mass.

When training the model, as far as possible to make a predicted value close to the real value, so that the smallest error, and a method for such an error mean square error is expressed, so solving multiple linear regression model, when it solved the corresponding mean square error minimizing parameters.

The advantages and disadvantages 3) linear regression

Linear regression is one of the most common tasks regression algorithm. Its simplest form is a continuous hyperplane set to fit the data, e.g., when there are only two variables by fitting a straight line. If the variable in the data set there is a linear relationship, the fit is quite high.

Understanding and interpretation of linear regression are very intuitive, but also through regularization to avoid over-fitting. In addition, linear regression model is easy to update the data model by a stochastic gradient descent method. However, when dealing with non-linear regression relationship is very bad, on identifying complex patterns are also not flexible enough, and add the correct item or polynomial interaction is extremely difficult and time-consuming.

2. Spark MLlib of SGD linear regression algorithm

Spark MLlib 的 SGD 线性回归算法是由 LinearRegressionWithSGD 类实现的,该类是基于无正规化的随机梯度下降算法,使用由(标签,特征序列)组成的 RDD 来训练线性回归模型的。

每一对(标签,特征序列)描述一组特征/以及这些特征所对应的标签。算法按照指定的步长进行迭代,迭代的次数由参数说明,每次迭代时,用来计算下降梯度的样本数也是由参数给出的。

Spark MLlib 中的 SGD 线性回归算法的实现类 LinerRegressionWithSGD 具有以下变量。

class LinerRegressionWithRGD private (
private var stepSize: Double,
private var numIterations: Int,
private var miniBatchFraction: Double
)

1)Spark MLlib 的 LinerRegressionWithRGD 构造函数

使用默认值构造 SparkMLlib 的 LinerRegressionWithRGD 实例的接口如下。

{stepSize:1.0,numIterations:100,miniBatchFraction:1.0}。

参数的含义解释如下。

  • stepSize 表示每次迭代的步长。
  • numIterations 表示方法单次运行需要迭代的次数。
  • miniBatchFraction 表示计算下降梯度时所使用样本数的比例。

2)Spark MLlib 的 LinerRegressionWithRGD 训练函数

Spark MLlib 的 LinerRegressionWithRGD 训练函数 LinerRegressionWithRGD.train 方法有很多重载方法,这里展示其中参数最全的一个来进行说明。LinerRegressionWithRGD.train 方法预览如下。

def train(
input:RDD[LabeledPoint],
numIterations:Int,
stepSize:Double,
miniBatchFraction:Double,
initialWeights:Vector):LinearRegressionModel

参数 numIterations、stepSize 和 miniBatchFraction 的含义与构造函数相同,另外两个参数的含义如下。

  • input 表示训练数据的 RDD,每一个元素由一个特征向量和与其对应的标签组成。
  • initialWeights 表示一组初始权重,每个对应一个特征。

3. Spark MLlib 的 SGD 线性回归算法实例

该实例使用数据集进行模型训练,可通过建立一个简单的线性模型来预测标签的值,并且可通过计算均方差来评估预测值与实际值的吻合度。

本实例使用 LinearRegressionWithSGD 算法 建立预测模型的步骤如下。
① 装载数据。数据以文本文件的方式进行存放。
② 建立预测模型。设置迭代次数为 100,其他参数使用默认值,进行模型训练形成数据模型。
③ 打印预测模型的系数。
④ 使用训练样本评估模型,并计算训练错误值。

该实例使用的数据存放在 lrws_data.txt 文档中,提供了 67 个数据点,每个数据点为 1 行,每行由 1 个标签值和 8 个特征值组成,每行的数据格式如下。

标签值,特征 1 特征 2 特征 3 特征 4 特征 5 特征 6 特征 7 特征 8

其中,第一个值为标签值,用“,”与特征值分开,特征值之间用空格分隔。前 5 行的数据如下。

-0.4307829,-1.63735562648104 -2.00621178480549 -1.86242597251066 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.8644665073373.06

-0.1625189,-1.98898046126935 -0.722008756122123 -0.787896192088153 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

-0.1625189, -1.578 818 8754 8545, -2.1887840293994 1.36116336875686 -1.02470580167082 -0.522940888712441 -0.863171185425945 0.342627053981254 -0.155348103855541

-0.1625189,-2.16691708463163 -0.807993896938655 -0.787896192088153 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

0.3715636,-0.507874475300631 -0.458834049396776 -0.250631301876899 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306

在本实例中,将数据的每一列视为一个特征指标,使用数据集建立预测模型。实现的代码如下。

  1. import java.text.SimpleDateFormat
  2. import java.util.Date
  3. import org.apache.log4j.{Level,Logger}
  4. import org.apache.spark.mllib.linalg.Vectors
  5. import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LabeledPoint}
  6. import org.apache.spark.{SparkContext, SparkConf}
  7.  
  8. /**
  9. * 计算回归曲线的MSE
  10. * 对多组数据进行模型训练,然后再利用模型来预测具体的值 * 公式:f(x) =a1*x1+a2*x2+a3*x3+….
  11. */
  12.  
  13. object LinearRegression2 {
  14.  
  15. //屏蔽不必要的曰志
  16. Logger.getLogger(“org.apache.spark”).setLevel(Level.WARN)
  17. Logger.getLogger(“org.apache.eclipse.jetty.server”).setLevel(Level.OFF)
  18. //程序入口
  19. val conf = new SparkConf().setAppName(LinearRegression2).setMaster(“local[1]”)
  20. val sc = new SparkContext(conf)
  21. def main(args:Array[String]) {
  22. //获取数据集路径
  23. val data = sc.textFile ((“/home/hadoop/exercise/lpsa2.data”,1)
  24. //处理数据集
  25. val parsedData = data.map{ line =>
  26. val parts = line.split (“,”)
  27. LabeledPoint (parts(0).toDouble,Vectors.dense(parts(1).split(”).map(_.toDouble)))
  28. }
  29. //建立模型
  30. val numIterations = 100
  31. val model = LinearRegressionWithSGD.train(parsedData, numIterations,0.1)
  32. //获取真实值与预测值
  33. val valuesAndPreds = parsedData.map { point =>
  34. //对系数进行预测
  35. val prediction = model.predict(point.features)
  36. (point, label, prediction) //(实际值,预测值)
  37. }
  38.  
  39. //打印权重
  40. var weights = model.weights
  41. printIn(“model.weights” + weights)
  42. //存储到文档
  43. val isString = new SimpleDateFormat(“yyyyMMddHHmmssSSS”).format{new Date ())
  44. val path = “(“/home/hadoop/exercise/” + isString + “/results“)”
  45. ValuesAndPreds.saveAsTextFile(path)
  46. //计算均方误差
  47. val MSE = valuesAndPreds.map {case (v,p) => math.pow ((v – p),2)}.reduce(_ + _) /valuesAndPreds.count
  48. printIn (“训练的数据集的均方误差是” + MSE)
  49. sc. stop ()
  50. }
  51. }

运行程序会打印回归公式的系数和训练的数据集的均方误差值。将每一个数据点的预测值,存放在结果文件中,数据项的格式为(实际值,预测值)。

4. 逻辑回归

逻辑回归是用来找到事件成功或事件失败的概率的。首先要明确一点,只有当目标变量是分类变量时,才会考虑使用逻辑回归方法,并且主要用于两种分类问题。

1)逻辑回归举例

医生希望通过肿瘤的大小 X1、长度X2、种类 X3 等特征来判断病人的肿瘤是恶性肿瘤还是良性肿瘤,这时目标变量 y 就是分类变量(0 表示良性肿瘤,1 表示恶性肿瘤)。线性回归是通过一些 x 与 y 之间的线性关系来进行预测的,但是此时由于 y 是分类变量,它的取值只能是 0、1,或者 0、1、2 等,而不能是负无穷到正无穷,所以引入了一个 sigmoid 函数,即   ,此时 x 的输入可以是负无穷到正无穷,输出 y 总是[0,1],并且当 x=0 时,y 的值为 0.5,如图 3(a) 所示。

x=0 时,y=0.5,这是决策边界。当要确定肿瘤是良性还是恶性时,其实就是要找出能够分开这两类样本的边界,也就是决策边界,如图 3(b) 所示。

sigmoid function graph and a schematic decision boundary
图 3  sigmoid 函数曲线图和决策边界示意

2)逻辑回归函数

在分类情形下,经过学习之后的逻辑回归分类器其实就是一组权值。当测试样本集中的测试数据来到时,将这一组权值按照与测试数据线性加和的方式,求出一个 z 值,即 ,其中, 是样本数据的各个特征,维度为 m。之后按照 sigmoid 函数的形式求出 ,即 逻辑回归函数的意义如图 4 所示。

Meaning logistic regression function schematic
图 4  逻辑回归函数的意义示意

由于 sigmoid 函数的定义域是 (-inf,inf),而值域为 (0,1),因此最基本的逻辑回归分类器适合对二分目标进行分类。

方法是,利用 sigmoid 函数的特殊数学性质,将结果映射到 (0,1) 中,设定一个概率阈值(不一定非是 0.5),大于这个阈值则分类为 1,小于则分类为 0。

求解逻辑回归模型参数的常用方法之一是,采用最大似然估计的对数形式构建函数,再利用梯度下降函数来进行求解。

3)逻辑回归的优缺点

逻辑回归特别适合用于分类场景,尤其是因变量是二分类的场景,如垃圾邮件判断,是否患某种疾病,广告是否点击等。逻辑回归的优点是,模型比线性回归更简单,好理解,并且实现起来比较方便,特别是大规模线性分类时。

逻辑回归的缺点是需要大样本量,因为最大似然估计在低样本量的情况下不如最小二乘法有效。逻辑回归对模型中自变量的多重共线性较为敏感,需要对自变量进行相关性分析,剔除线性相关的变量,以防止过拟合和欠拟合。

52. The decision trees and naive Bayes algorithm
53. regression analysis
54. Cluster analysis Introduction
55 .k-means clustering algorithm
56 .DBSCAN clustering algorithm
57 association rules data mining analysis
58. The Apriori algorithm and FP-Tree algorithm
59. based on a large data precision marketing of
60. the personalized recommendation system based on large data
61. big data predictive
62. the other big data applications
63. large data which can be applied in industry
64. the application of big data in the financial sector
65. big data applications in the Internet industry
66. the application of big data in the logistics industry

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92418232