Do it once a week to improve yourself.
This blog post summarizes the math library algorithm. First, the CATH algorithm in the previous blog will be run through.
Address: https://my.oschina.net/wangzonghui/blog/1618690
Step 1: Data Analysis
The prepared data is a comma-separated txt text file, and the last column of each line is classified.
First consider the data, those are valid data, useless data, data cleaning in advance.
Step 2: Data read in
When reading the data, I didn't know much about python, and I didn't find a good class library to process the data. According to the programming habits of Java, I wrote one myself.
def createSet(trainDataFile):
dataSet=[]
labels=[]
try:
fin=open(trainDataFile,'r')
num =0;
for line in fin:
if num==0:
line=line.strip('\n')
labels=line.split(',')
else:
line=line.strip('\n')
cols=line.split(',')
# print len(cols)
# #int(cols[0]),int(cols[1]),int(cols[2]),long(cols[3]),
row =[int(cols[4]),float(cols[5]),float(cols[6]),int(cols[7]),int(cols[8]),float(cols[9]),
int(cols[10]),float(cols[11]),int(cols[12]),float(cols[13]),int(cols[14]),int(cols[15]),int(cols[16]),int(cols[17]),float(cols[18]),
int(cols[19]),float(cols[20]),int(cols[21]),int(cols[22]),int(cols[23]),float(cols[24]),int(cols[25]),int(cols[26]),int(cols[27]),float(cols[28]),
int(cols[29]),int(cols[30]),int(cols[31]),int(cols[32]),float(cols[33]),int(cols[34]),float(cols[35]),float(cols[36]),float(cols[37]),float(cols[38]),
int(cols[39]),int(cols[40])]
dataSet.append(row)
num+=1
except Exception as e:
print 'Usage xxx.py trainDataFilePath'
print e
#删除无效指标
del labels[0];del labels[1];del labels[2];del labels[3]
return dataSet,labels
The row object data after the row object is converted to the corresponding data type according to the actual situation, which is convenient for later processing. The first row of my text is the label name of the column.
To clarify, the number of label name fields is less than the actual number of data fields. The last column - classification.
Step 3: Create a decision tree
First generate data, and then call the cart algorithm to generate a decision tree.
#生成模板数据
dataSet,labels=createSet(url)
#复制标签 测试决策树时,原有标签对象不可用
labels_tmp=labels[:]
#用决策树
desicionTree = createTree(dataSet,labels_tmp)
Step 4: Store the decision tree
Save the generated decision tree, and you do not need to generate it from template data next time. When the service is deployed, directly load the decision tree for use.
#保存决策树
def storeTree(inputTree,filename):
import pickle
fw=open(filename,'wb')
pickle.dump(inputTree,fw)
fw.close()
Step 5: Call the stored decision tree
#加载决策树
def grabTree(filename):
import pickle
fr=open(filename,'rb')
return pickle.load(fr)
Step 6: Test Data Validation
Validate decision trees with test data
def createTestSet():
testSet=[[0,0.0,0.0,13,3,15,3,88.89,15,100.0,0,0,0,0,0.0,0,0.0,0,6,0,0.0,0,0,0,0.0,2215,5818,0,0,0.0,0,4.4,4.4,4.4,0.01,14,0]]
return testSet
All calling codes are as follows:
def main():
url="F:\\input\\eciMath.txt";
treeUrl="F:\\input\\tree\\tree-math.txt";
# dataSet,labels=test(url)
dataSet,labels=createSet(url)
# print "开始创建决策树"
labels_tmp=labels[:] #copy 标签
desicionTree = createTree(dataSet,labels_tmp)
#保存决策树
storeTree(desicionTree,treeUrl)
print desicionTree
print "创建决策树结束"
import showTree as show
show.createPlot(desicionTree)
print 'desicionTree:',desicionTree
desicionTree=grabTree(treeUrl)
testSet=createTestSet()
print 'classifyResult:',calssifyAll(desicionTree,labels,testSet)
if __name__ == '__main__':
main()
I feel that the math runs slowly, the test data is 1w lines, and the 8G storage computer is almost stuck. I feel that the efficiency is very slow, and it can be used for research, but it is not practical.