VC-dimension

VC-dimension的定义

机器学习中，我们希望引入一个量化指标，去衡量一个learner的最大表达能力，这就是VC-dimension

直观定义

不妨把多维数据想象为空间中的点，分类器learner想象为曲面。
给定空间点的个数m，甲乙两人做对抗游戏：

甲：选取m个点的位置
乙：从中选取0~m个数据作为一类（换句话说，给这m个数据加0/1 label）
甲：给出一种参数取值，使得learner能够正确分类

乙需要尽可能地刁难甲
如果甲能做到，说明这个learner能够轻松应对m个数据的某种分布的任意label取值，VC-dimension $\ge$ m
否则VC-dimension $<$ m

逻辑定义

对于给定的m，如果
$\exist x_1,\cdots,x_m, \forall l_1,\cdots,l_m, \exist \theta, f(x_i;\theta)=l_i , i = 1,\cdots,m$
则 $\mathrm{VC}(f) \ge m$
否则 $\mathrm{VC}(f)<m$

更一般的数学定义

A set system $\mathcal{H})$ consists of a set $X$ and a class $\mathcal H$ of subsets of $X$ , i.e. $\mathcal{H} \subseteq P(X)$
( $X$ is a instance space, $\mathcal H$ is a class of classifiers)
A set system $\mathcal{H})$ shatters a set $A\ \subseteq X$ iff $\forall A' \subseteq A, \exist h \in \mathcal H, A'=A\cap h$
The VC-dimension of $\mathcal H$ is $\mathrm{VC}(\mathcal H) = \max_{A\ is\ shattered\ by\ \mathcal H}|A|$

VC-dimension的应用

在这里插入图片描述
在数据量给定的情况下，随着模型参数增多，表达能力增强，Train Error不断减小，但一般误差中的VC项变大，导致模型的Test Error先减小后增大。增大对应着我们一般说的过拟合。

主要定理

Definition

For a set system $\mathcal{H})$ , the shatter function $\pi_{\mathcal H}(n)$ is the maximum number of subsets of any set A of size n that can be expressed as $A\cap h$ for some $\in \mathcal H$ , i.e.
$\pi_{\mathcal H}(n) = \max_{|A|=n}|\{A\cap h| h\in \mathcal H \}|$

Lemma (Sauer)

For a set system $\mathcal{H})$ whose VC-dimension equals $d$ ,
$\pi_{\mathcal H}(n) \begin{cases} = 2^n & ,n \le d\\ \le \dbinom{n}{\le d} & ,n > d \end{cases}$
where
$\dbinom{n}{\le d}=\dbinom n 0 + \dbinom n 1 + \cdots + \dbinom n d \le n^d+1$

The Key Theorem

With sufficiently large $n$ , $\ge \dfrac 4\epsilon$ and $\ge \dfrac 1\epsilon \Big( \log_2\pi_{\mathcal H}(2n)+\log_2\dfrac 2 \delta \Big)$
Given training set $T$ , $∣ T ∣ = n$
$\mathrm{Prob}\Big[\exist h, \mathrm{TrueErr}(h)=\epsilon, \mathrm{TrainErr}(h)=0\Big] < \delta$