Note: Please add any relevant keyword or phrase in pare

Author: Zen and the Art of Computer Programming

1 Introduction

  Machine learning (English: Machine learning) is a new field of technology that studies how to make computer systems "learn" and improve their behavior, so that they can solve general problems autonomously. Machine learning systems acquire new knowledge automatically through experience learning, inductive analysis, or model-based. At present, machine learning has been applied in a variety of fields, including search engines, image recognition, speech recognition, recommendation systems, network security, bioinformatics, finance, etc.

  This article will introduce a classic machine learning algorithm - decision tree algorithm (decision tree). Decision tree is a classifier in the form of branch structure, which can divide the data of multi-dimensional feature space, assign the instances in the data set to different leaf nodes, and form a series of judgment rules. This algorithm was proposed by Professor Zhou Zhihua in 1986 and has been widely used in classification, regression, pattern recognition, clustering and other fields. Decision trees have many advantages, such as easy to understand, lack of parameters, insensitive to outliers, can handle irrelevant features, can handle high-dimensional data, support multi-task learning, and are suitable for classification problems.

  In the next chapters, I will introduce the basic concepts, terms and principles of the decision tree algorithm one by one, as well as the actual code examples and their running process. Finally, I will also introduce the future development direction and current problems of decision trees. I hope readers can read patiently and provide valuable comments.

2. Explanation of basic concepts and terms

  The decision tree algorithm is to generate a tree-structured model that can represent a conditional probability distribution when a training data set is given. The decision tree is a process of recursive downward division. Each step divides the data set according to a certain feature. If the different values ​​​​of a certain feature lead to obvious differences in categories, then further divide the subset; otherwise, choose The optimal segmentation point is used as the splitting point.

  Suppose a training data set T={(x1,y1),(x2,y2),...,(xn,yn)} is given, where xi∈X is the input variable (called feature), yi∈Y is the output variable (called the label). The goal of a decision tree is to learn a set of conditional probability distributions p(y|x1,x

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132288985