Python math library decision tree CART algorithm actual data analysis

Do it once a week to improve yourself.

This blog post summarizes the math library algorithm. First, the CATH algorithm in the previous blog will be run through.

Address: https://my.oschina.net/wangzonghui/blog/1618690

Step 1: Data Analysis

The prepared data is a comma-separated txt text file, and the last column of each line is classified.

First consider the data, those are valid data, useless data, data cleaning in advance.

Step 2: Data read in

When reading the data, I didn't know much about python, and I didn't find a good class library to process the data. According to the programming habits of Java, I wrote one myself.

def createSet(trainDataFile):
	dataSet=[]
	labels=[]
	try:
		fin=open(trainDataFile,'r')
		num =0;
		for line in fin:
			if num==0:
				line=line.strip('\n') 
				labels=line.split(',')
			else:
				line=line.strip('\n')
				cols=line.split(',')
				# print len(cols)
				# #int(cols[0]),int(cols[1]),int(cols[2]),long(cols[3]),
				row =[int(cols[4]),float(cols[5]),float(cols[6]),int(cols[7]),int(cols[8]),float(cols[9]),
					int(cols[10]),float(cols[11]),int(cols[12]),float(cols[13]),int(cols[14]),int(cols[15]),int(cols[16]),int(cols[17]),float(cols[18]),
					int(cols[19]),float(cols[20]),int(cols[21]),int(cols[22]),int(cols[23]),float(cols[24]),int(cols[25]),int(cols[26]),int(cols[27]),float(cols[28]),
					int(cols[29]),int(cols[30]),int(cols[31]),int(cols[32]),float(cols[33]),int(cols[34]),float(cols[35]),float(cols[36]),float(cols[37]),float(cols[38]),
					int(cols[39]),int(cols[40])]
				dataSet.append(row)
			num+=1
	except Exception as e:
		print 'Usage xxx.py trainDataFilePath'
		print e
		
	#删除无效指标
	del labels[0];del labels[1];del labels[2];del labels[3]
	return dataSet,labels

The row object data after the row object is converted to the corresponding data type according to the actual situation, which is convenient for later processing. The first row of my text is the label name of the column.

To clarify, the number of label name fields is less than the actual number of data fields. The last column - classification.

Step 3: Create a decision tree
First generate data, and then call the cart algorithm to generate a decision tree.

#生成模板数据
dataSet,labels=createSet(url)
#复制标签 测试决策树时,原有标签对象不可用
labels_tmp=labels[:]
#用决策树
desicionTree = createTree(dataSet,labels_tmp)

Step 4: Store the decision tree

Save the generated decision tree, and you do not need to generate it from template data next time. When the service is deployed, directly load the decision tree for use.

#保存决策树
def storeTree(inputTree,filename):
	import pickle
	fw=open(filename,'wb')
	pickle.dump(inputTree,fw)
	fw.close()

Step 5: Call the stored decision tree

#加载决策树
def grabTree(filename):
	import pickle
	fr=open(filename,'rb')
	return pickle.load(fr)

Step 6: Test Data Validation

Validate decision trees with test data

def createTestSet():
	testSet=[[0,0.0,0.0,13,3,15,3,88.89,15,100.0,0,0,0,0,0.0,0,0.0,0,6,0,0.0,0,0,0,0.0,2215,5818,0,0,0.0,0,4.4,4.4,4.4,0.01,14,0]]
	return testSet

All calling codes are as follows:

def main():
	url="F:\\input\\eciMath.txt";
	treeUrl="F:\\input\\tree\\tree-math.txt";
	# dataSet,labels=test(url)

	dataSet,labels=createSet(url)

	# print "开始创建决策树"
	labels_tmp=labels[:] #copy 标签
	desicionTree = createTree(dataSet,labels_tmp)
	#保存决策树
	storeTree(desicionTree,treeUrl)
	print desicionTree
	print "创建决策树结束"
	import showTree as show
	show.createPlot(desicionTree)
	print 'desicionTree:',desicionTree

	desicionTree=grabTree(treeUrl)
	testSet=createTestSet()
	print 'classifyResult:',calssifyAll(desicionTree,labels,testSet)


if __name__ == '__main__':
	main()

I feel that the math runs slowly, the test data is 1w lines, and the 8G storage computer is almost stuck. I feel that the efficiency is very slow, and it can be used for research, but it is not practical.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325151422&siteId=291194637