まず、実験の目的:
ツリー構造による特徴的な特性を有する試料を分類するための決定木分類アルゴリズム(決定木)。典型的なアルゴリズムはID3アルゴリズム、C4.5アルゴリズム、C5.0アルゴリズム、CARTアルゴリズムが含まれます。情報利得の使用を習得するための実験は、ID3決定木誘導を達成しています。
第二に、実験的なソフトウェア:
Rstudio
第三に、実験的なアイデア
1.計算決定属性エントロピー情報(D)
2.計算各属性のエントロピー:計算年齢、収入、学生、条件付きエントロピーの信頼性Info_A(D)
3.情報ゲイン各属性のゲイン(A)=情報( D)-InfoA(D)
4.ノード:最大情報利得データセットの属性分類を選択します
第四に、ソースコード:
#示例数据集
data<-data.frame(
Age=c("youth","youth","middle_aged","senior","senior","senior","middle_aged","youth","youth","senior","youth","middle_aged","middle_aged","senior"),
income=c("high","high","high","medium","low","low","low","medium","low","medium","medium","medium","high","medium"),
student=c("no","no","no","no","yes","yes","yes","no","yes","yes","yes","no","yes","no"),
credit_rating=c("fair","excellent","fair","fair","fair","excellent","excellent","fair","fair","fair","excellent","excellent","fair","excellent"),
buys_computer=c("no","no","yes","yes","yes","no","yes","no","yes","yes","yes","yes","yes","no")
)
info<-function(data){
rowCount=nrow(data) #计算数据集中有几行,也即有几个样本点
colCount=ncol(data)
class_result = levels(factor(data[,colCount])) # #决策变量的属性"no""yes"
class_Count = c() #存放个数
class_Count[class_result]=rep(0,length(class_result))
for(i in 1:rowCount){ #计算决策变量中每个值出现的个数
if(data[i,colCount] %in% class_result)
temp=data[i,colCount]
class_Count[temp]=class_Count[temp]+1
}
#1.计算总体的信息熵
p = c()
info = 0
for (i in 1:length(class_result)) {
p[i] = class_Count[i]/rowCount
info = -p[i]*log2(p[i])+info
}
# 2.计算每个属性的信息熵
infoA_D = function(data,k){
split_A = levels(factor(data[,k])) #某个属性可能的值
split_Acount = data.frame()
split_Acount[split_A,] = rep(0,length(split_A))
split_Acount[,class_result] = rep(0,length(split_A))
# split_Acount
for(i in 1:rowCount){ #计算决策变量中每个值出现的个数
if(data[i,k] %in% split_A & data[i,colCount] %in% class_result )
temp_A=data[i,colCount]
temp2 = data[i,k]
split_Acount[as.character(temp2),as.character(temp_A)]=split_Acount[as.character(temp2),as.character(temp_A)]+1
split_Acount$Count_D = split_Acount$no +split_Acount$yes
}
p_A = c()
info_A = 0
D_j=0
no_rate = 0
yes_rate = 0
for (i in 1:length(split_A)) {
p_A[i] = split_Acount$Count_D[i]/rowCount
no_rate[i] = split_Acount$no[i]/split_Acount$Count_D[i]
yes_rate[i] = split_Acount$yes[i]/split_Acount$Count_D[i]
D_j[i] = -(no_rate[i]*log2(no_rate[i])+yes_rate[i]*log2(yes_rate[i]))
if(D_j[i] == "NaN"){ #出现取对数为NaN的情况
D_j[i] = 0
}
info_A = p_A[i]*D_j[i] +info_A
}
return(info_A )
}
infoA_D(data,1) #age
infoA_D(data,2) #income
infoA_D(data,3) #student
infoA_D(data,4) #credi_rating
#每个属性的信息增益
age_Gain = info - infoA_D(data,1)
income_Gain = info - infoA_D(data,2)
student_Gain = info - infoA_D(data,3)
credit_rating_Gain = info - infoA_D(data,4)
Gain_frame<-data.frame(age_Gain,income_Gain,student_Gain,credit_rating_Gain)
createTree = function(gain_data,data,split_max){
#选出最大的信息增益所在的列
max<-max.col(gain_data)
#最大信息增益的属性为age属性
#根据age属性对数据集进行分类
split_max = levels(factor(data[,max]))
select = list()
for(i in 1:length(split_max)){
select[i] = list(data[data[,max] == split_max[i] ,] )
}
return(select)
}
decision_tree = createTree(Gain_frame,data)
return(list(Gain_frame,decision_tree))
}
infoResult = info(data)
infoResult
V.結果:
それはスプリットシート選択属性があった最高の利得を有する年齢属性情報、そうので、メタデータは、年齢の各出力に基づいて、三つのサブマトリクスに分割されます。