4 logistic regression
An output value of the logistic regression function is a sigmoid function is assumed, the larger the range of change pressed into the (0,1), the pressing function is also known as
\ [h_ \ theta (x) = \ frac {1} { } \ {4.1} Tag \] - 1 + {E ^ \ Theta the Tx} ^
\ (H_ \ Theta (x) \) Y = 1 when the input represents the probability of x
4.1 decision boundary
If the predetermined \ (H_ \ Theta (X) \ ge0.5 \) Y = time. 1, \ (H_ \ Theta (X) <0.5 \) when y = 0, it can be concluded when the \ (\ theta ^ Tx \ GE0 \) Y =. 1, the when the \ (\ Theta the Tx ^ <0 \) Y = 0 when
If the fit parameter \ (\ Theta \) after, \ (\ Theta the Tx ^ \) configuration decision boundary
- Decision boundary is not a training set of attributes, when a given parameter \ (\ theta \) after the decision of the decision boundary
4.2 Sample single cost function
If the cost function using linear regression, Sigmoid function resulting in non-convex function, gradient descent method will be local optimum.
\ [\ Text {Cost} ( h_ \ theta (x), y) = \ begin {cases} -log (h_ \ theta (x)), & \ text {if} \ y = 1 \\ -log (1 -h_ \ theta (x)), & \ text {if} \ y = 0 \ end {cases} \ tag {4.2} \]
Logistic regression cost function 4.3 Functions
\[ \begin{aligned} J(\theta) &=\frac{1}{m} \sum_{i=1}^{m} \operatorname{cost}(h_{\theta}(x^{(i)}), y^{(i)}) \\ &=-\frac{1}{m}[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)})+(1-y^{(i)}) \log (1-h_{\theta}(x^{(i)}))] \end{aligned}\tag{4.3} \]
Then different algorithms to minimize the cost function
4.3.1 gradient descent
\[ \begin{aligned} \theta_j&=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &=\theta_j-\alpha\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \end{aligned}\tag{4.4} \]
- Multiple linear regression with a gradient descent method except that different assumptions functions
- Wherein when a large range, can be scaled using the same feature that the gradient descent converge faster
4.3.2 Other advanced algorithms
- Conjugate Gradient Method
- BFGS
- L-BFGS
No need to manually select the learning rate and convergence rate is higher than the gradient descent, but more complex algorithm
More than 4.4 classification
Each extraction as a positive class category, the remaining negative type, repeated several times to obtain a plurality hypothesis as a function of a plurality of classifiers
When a new sample forecast, were used to predict each classifier, and a summary of all the results, most of the classification results as a prediction result of the new sample