Kernel Methods

1. Feature maps

To fit a non-lineal function, a more expressive family of models need to be created. Considering fitting cubic functions $y=\theta_3x^3+\theta_2x^2+\theta_1x+\theta_0$, $\theta$ is the parameter to be estimated. Concretely, let the function $ \phi: \mathbb{R}\rightarrow \mathbb{R}^4 $ be defined as

\begin{equation} \phi(x) = \left [ \begin{array}{c} 1 \\ x \\ x^2 \\ x^3 \end{array} \right ] \in \mathbb{R}^4 \label{xd4}\end{equation}

Let $\theta \in \mathbb{R}^4$ be a vector containing $\theta_3, \theta_2, \theta_1, \theta_0$ as entries. Then we can rewrite the cubic function in $x$ as : $$\theta_3x^3+\theta_2x^2+\theta_1x+\theta_0=\theta^T\phi(x)$$

Thus, a cubic function of the variable x can be viewed as a linear function over the variables $\phi(x)$. To distinguish between the two sets of variables, in the context of kernel methods, we call the “original” input value the input attributes of the a problem (in this case, $x$). When the original is mapped into some new sets of quantities $\phi(x)$, we will call those new quantities the feature variables. We will call $\phi$ a feature map, which maps the feature attributes to the features.

2. Least mean square regression

Let $S={(x^{(i)}, y^{(i)})}_{i=1}^n$ be a training set. Generally, we use a function $h_\theta(x)$ to approximate $y$. To measure the estimation error, we define the cost function:

$$ J(\theta) = \frac{1}{2}\sum_{i=1}^{n}(h_\theta(x^{(i)}) - y^{(i)})^2 $$

To minimize $J(\theta)$, lets consider the gradient decent algorithm. The batch gradient decent algorithm give the update rule

$$\theta := \theta + \alpha(-\frac{\partial J(\theta)}{\partial \theta}) = \theta + \alpha \sum_{i=1}^{n}(y^{(i)} - h_\theta(x^{(i)}) )\frac{\partial h_\theta(x^{(i)})}{\partial \theta} $$

Whereas batch gradient algorithm has to scan through the entire training set before taking a single step – a costly operation if n is large. In practice most value near the minimum will be a reasonable good approximations of the true minimum.

Stochastic gradient decent algorithm update rule:

$$ \theta := \theta + \alpha(y^{(i)} - h_\theta(x^{(i)}) )\frac{\partial h_\theta(x^{(i)})}{\partial \theta} $$

Therefore, when the training set is large, stochastic gradient decent is often preferred over batch gradient decent.

When $h_\theta(x) = \theta^T\phi(x)$, the stochastic gradient decent update rule is

\begin{equation} \theta := \theta + \alpha(y^{(i)} - h_\theta(x^{(i)})) \phi(x^{(i)}) \label{theta}\end{equation}

From (\ref{xd4}), each step we need to store the variable: $ \theta \in \mathbb{R}^4, \phi(x) \in \mathbb{R}^{n\times 4} $. When $\phi(x) \in \mathbb{R}^d$, and $d$ is very large, $d >> n$, the update rule in (\ref{theta}) is not efficient. From (\ref{theta}), $\theta$ can be represented as a linear recombination of the vectors $\phi(x^{(1)}), \dots, \phi(x^{(n)})$. So, assume at some point, $\theta$ can be represented as

\begin{equation} \theta = \sum_{i=1}^{n} \beta_i \phi(x^{(i)}) \label{thetac} \end{equation}

From batch gradient decent update rule, we have

\begin{eqnarray} \theta &:=& \theta + \alpha \sum_{i=1}^{n}(y^{(i)} - h_\theta(x^{(i)})) \phi(x^{(i)}) \nonumber \\ & = & \sum_{i=1}^{n} \beta_i \phi(x^{(i)}) + \alpha \sum_{i=1}^{n}(y^{(i)} - h_\theta(x^{(i)})) \phi(x^{(i)}) \nonumber \\ & = & \sum_{i=1}^{n} (\beta_i + \alpha (y^{(i)} - h_\theta(x^{(i)})) \phi(x^{(i)}) = \sum_{i=1}^{n} \beta_i’ \phi(x^{(i)}) \end{eqnarray}

Using the equation above, we see that the new $ \beta_i $ depends on the old one via:

\begin{equation} \beta_i := \beta_i + \alpha (y^{(i)} - h_\theta(x^{(i)})\phi(x^{(i)})) \end{equation}

Replacing $\theta$ by $\theta = \sum_{j=1}^{n} \beta_j \phi(x^{(j)})$ gives

\begin{eqnarray} \forall i \in \{1,\dots,n\} \quad \beta_i &:=& \beta_i + \alpha (y^{(i)} -\sum_{j=1}^{n} \beta_j \phi(x^{(j)})^T\phi(x^{(i)})) \nonumber \\ &=& \beta_i + \alpha (y^{(i)} -\sum_{j=1}^{n} \beta_j \langle \phi(x^{(j)}), \phi(x^{(i)}) \rangle ) \label{betan} \end{eqnarray}

From (\ref{betan}), each step we need to store the variable: $ \beta \in \mathbb{R}^n, \langle\phi(x), \phi(x) \rangle \in \mathbb{R}^{n\times n} $.

3. Kernel

We define the Kernel corresponding to the feature map $\phi$ as a function that maps $ \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} $ satisfying:

$$ K(x, z) \triangleq \langle \phi(x), \phi(z) \rangle $$

Using the kernel, \ref{betan} can be expressed as

$$ \beta_i = \beta_i + \alpha (y^{(i)} -\sum_{j=1}^{n} \beta_j K(\phi(x^{(j)}), \phi(x^{(i)}))) $$

For parameter $\theta$, we have

$$ \theta^T \phi(x) = \sum_{i=1}^{n} \beta_i K(x^{(i)}, x)$$

You may realize that fundamentally all we need to know about the feature map $ \phi(\cdot)$ is encapsulated in the corresponding kernel function $K(\cdot,\cdot)$. Therefore, we only need to ensure the existence of the feature map , but do not necessarily need to be able to explicitly write $\phi$ down.

Starting from other view, if we define (\ref{xd4}) as a basis functions of feature space. The inner product just as projecting another function into the feature space. So, we can think of $K(x, z)$ as some measurement of how similar are $\phi(x)$ and $\phi(z)$, or of how similar are $x$ and $z$. If $x$ and $z$ are the random variable, the kernel function can be described as the correlation of two random variable.

4. Kernel in random variable

From linear Gaussian regression, the standard linear regression model with Gaussian noise:

\begin{equation} y = x^T w + \varepsilon \end{equation}

where $x$ is the input vector, $w$ is a vector of weights (parameters $\theta$) of linear model.

In feature space, the model is

\begin{equation} y = f(x) + \varepsilon =\phi(x)^T w + \varepsilon \end{equation}

The observed values $y$ differ from the function values $f(x)$ by additive noise $\varepsilon$

$$ \varepsilon \sim \mathcal N(0, \sigma_n^2) $$

In the Bayesian formalism, we need to specify a prior over the parameters.

$$ w \sim \mathcal N(0, \Sigma_p) $$

Let the matrix $\Phi(X)$ is aggregation of columns $\phi(x)$ for all case in the training set. The analysis of for this model is analogous to the standard linear model. Thus the predictive distribution becomes

$$ f_* | x_*, X,y \sim \mathcal N (\frac{1}{\sigma_n^2}\phi(x_*)^TA^{-1}\Phi(X)y, \phi(x_*)^TA^{-1}\phi(x_*)) $$

with $A=\sigma_n^{-2} \Phi(X) \Phi(X)^T + \Sigma_p^{-1}$

To make prediction using this equation we need to invert the matrix $A \in \mathbb{R}^{d\times d}$ which may not be convenient if $d$, the dimension of the feature space, is very large. how ever we can rewrite the equation in the following way

\begin{eqnarray} f_* | x_*, X,y \sim \mathcal N (& \phi(x_*)^T\Sigma_p\Phi(K+\sigma_n^2 I)^{-1}y, \nonumber \\ & \phi(x_*)^T \Sigma_p \phi(x_*) - \phi(x_*)^T\Sigma_p\Phi(K+\sigma_n^2 I)^{-1}\Phi^T\Sigma_p\phi(x_*)) \label{feature_space_random}\end{eqnarray}

with $K = \Phi^T \Sigma_p \Phi $.

In this case, we need to invert the matrix $K+\sigma_n^2 \in \mathbb{R}^{n \times n}$ which is more convenient when $n < d$. Notice that in (\ref{feature_space_random}) the feature space always enters in the form of $\Phi^T \Sigma_p \Phi , \phi_*^T \Sigma_p \Phi , \phi_*^T \Sigma_p \phi_* $. Lets us define

$$ k(x,x’) = \phi(x)^T \Sigma_p \phi(x) $$

we call $k(\cdot,\cdot)$ a covariance function or kernel.

Theorem (Mercer). Let $K: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R} $ be given. Then for K to valid (Mercer) kernel, it is necessary and sufficient that for any ${x_1, \cdot, x_n},(n<\infty)$, the corresponding kernel matrix is symmetric positive semi-definite(PSD).

1. Feature maps

2. Least mean square regression

3. Kernel

4. Kernel in random variable

猜你喜欢