VC-dimension的定义
机器学习中,我们希望引入一个量化指标,去衡量一个learner的最大表达能力,这就是VC-dimension
直观定义
不妨把多维数据想象为空间中的点,分类器learner想象为曲面。
给定空间点的个数m,甲乙两人做对抗游戏:
- 甲:选取m个点的位置
- 乙:从中选取0~m个数据作为一类(换句话说,给这m个数据加0/1 label)
- 甲:给出一种参数取值,使得learner能够正确分类
乙需要尽可能地刁难甲
如果甲能做到,说明这个learner能够轻松应对m个数据的某种分布的任意label取值,VC-dimension ≥ \ge ≥m
否则VC-dimension < < <m
逻辑定义
对于给定的m,如果
∃ x 1 , ⋯ , x m , ∀ l 1 , ⋯ , l m , ∃ θ , f ( x i ; θ ) = l i , i = 1 , ⋯ , m \exist x_1,\cdots,x_m, \forall l_1,\cdots,l_m, \exist \theta, f(x_i;\theta)=l_i , i = 1,\cdots,m ∃x1,⋯,xm,∀l1,⋯,lm,∃θ,f(xi;θ)=li,i=1,⋯,m
则 V C ( f ) ≥ m \mathrm{VC}(f) \ge m VC(f)≥m
否则 V C ( f ) < m \mathrm{VC}(f)<m VC(f)<m
更一般的数学定义
-
A set system ( X , H ) (X, \mathcal{H}) (X,H) consists of a set X X X and a class H \mathcal H H of subsets of X X X, i.e. H ⊆ P ( X ) \mathcal{H} \subseteq P(X) H⊆P(X)
( X X X is a instance space, H \mathcal H H is a class of classifiers) -
A set system ( X , H ) (X, \mathcal{H}) (X,H) shatters a set A ⊆ X A\ \subseteq X A ⊆X iff ∀ A ′ ⊆ A , ∃ h ∈ H , A ′ = A ∩ h \forall A' \subseteq A, \exist h \in \mathcal H, A'=A\cap h ∀A′⊆A,∃h∈H,A′=A∩h
-
The VC-dimension of H \mathcal H H is V C ( H ) = max A i s s h a t t e r e d b y H ∣ A ∣ \mathrm{VC}(\mathcal H) = \max_{A\ is\ shattered\ by\ \mathcal H}|A| VC(H)=A is shattered by Hmax∣A∣
VC-dimension的应用
在数据量给定的情况下,随着模型参数增多,表达能力增强,Train Error不断减小,但一般误差中的VC项变大,导致模型的Test Error先减小后增大。增大对应着我们一般说的过拟合。
主要定理
Definition
For a set system ( X , H ) (X, \mathcal{H}) (X,H), the shatter function π H ( n ) \pi_{\mathcal H}(n) πH(n) is the maximum number of subsets of any set A of size n that can be expressed as A ∩ h A\cap h A∩h for some h ∈ H h \in \mathcal H h∈H, i.e.
π H ( n ) = max ∣ A ∣ = n ∣ { A ∩ h ∣ h ∈ H } ∣ \pi_{\mathcal H}(n) = \max_{|A|=n}|\{A\cap h| h\in \mathcal H \}| πH(n)=∣A∣=nmax∣{
A∩h∣h∈H}∣
Lemma (Sauer)
For a set system ( X , H ) (X, \mathcal{H}) (X,H) whose VC-dimension equals d d d,
π H ( n ) { = 2 n , n ≤ d ≤ ( n ≤ d ) , n > d \pi_{\mathcal H}(n) \begin{cases} = 2^n & ,n \le d\\ \le \dbinom{n}{\le d} & ,n > d \end{cases} πH(n)⎩⎨⎧=2n≤(≤dn),n≤d,n>d
where
( n ≤ d ) = ( n 0 ) + ( n 1 ) + ⋯ + ( n d ) ≤ n d + 1 \dbinom{n}{\le d}=\dbinom n 0 + \dbinom n 1 + \cdots + \dbinom n d \le n^d+1 (≤dn)=(0n)+(1n)+⋯+(dn)≤nd+1
The Key Theorem
With sufficiently large n n n, n ≥ 4 ϵ n \ge \dfrac 4\epsilon n≥ϵ4 and n ≥ 1 ϵ ( log 2 π H ( 2 n ) + log 2 2 δ ) n \ge \dfrac 1\epsilon \Big( \log_2\pi_{\mathcal H}(2n)+\log_2\dfrac 2 \delta \Big) n≥ϵ1(log2πH(2n)+log2δ2)
Given training set T T T, ∣ T ∣ = n |T|=n ∣T∣=n
P r o b [ ∃ h , T r u e E r r ( h ) = ϵ , T r a i n E r r ( h ) = 0 ] < δ \mathrm{Prob}\Big[\exist h, \mathrm{TrueErr}(h)=\epsilon, \mathrm{TrainErr}(h)=0\Big] < \delta Prob[∃h,TrueErr(h)=ϵ,TrainErr(h)=0]<δ