Machine Learning in C# - Should I Take the Job - Using Decision Trees

 decision tree

       For a decision tree to be complete and effective, it must encompass all possibilities. Event sequences must also be provided, and are mutually exclusive, meaning that if one event occurs, the other cannot.

       Decision trees are a form of supervised machine learning because we have to explain what the input and output should be. There are decision nodes and leaves. A leaf is a decision, final or not, and a node is where decision splitting occurs.

       While there are many algorithms available to us, we will use the Iterative Dichotomy (ID3) algorithm.

At each recursive step, the attribute that best classifies the input set we are processing is selected according to a criterion (information gain, gain ratio, etc.).

It must be pointed out here that no matter what algorithm we use, it is not guaranteed to produce the smallest possible tree. Because it directly affects the performance of the algorithm.

Remember that with decision trees, learning is based only on heuristics, not real optimization criteria. Let's explain this further with an example.

The following example from http://jmlr.csail.mit.edu/papers/volume8/esmeir07a/esmeir07a.pdf demonstrates the XOR learning concept, which all of us developers are (or should be) familiar with. This will also happen later in the example, but now a3 and a4 are completely irrelevant to the problem we are trying to solve. They have no effect on our answers. That is, the ID3 algorithm will choose one of the building trees, and in fact, it will use a4 as the root node! Remember, this is a heuristic learning of the algorithm, not an optimization result:

I hope this picture will make it easier for you to understand what you just said. Our goal is not to delve into the mechanics and theory of decision trees. But how to use it, despite its many problems, decision trees are still the basis of many algorithms, especially those that require human description of the results. This is also the basis for our previous face detection algorithm.

     decision node

A node of a decision tree. Each node may or may not have associated child nodes

     decision variables

       This object defines the properties of each decision variable that the tree and nodes can handle. Values ​​can be ranges, continuous, or discrete.

     collection of decision branch nodes

       This collection contains groups of one or more decision nodes together with additional information about the decision variables for comparison.

       Below is an example of a decision tree for determining financial risk. We just have to navigate between the nodes and follow it easily, deciding which way to go until we get to the final answer. In this case, when someone is applying for a loan and we need to make a decision about their creditworthiness. At this point decision trees are a great way to solve this problem:

Should I take this job?

       You just got a new job and you need to decide whether to take it. There are some important things to consider, so we use them as input variables or features for the decision tree.

What matters most to you: salary, benefits, company culture, and of course, can I work from home?

Instead of loading data from disk storage, we'll create an in-memory database and add features that way. We will create the DataTable and create the columns as shown in the image below:

After this, we will load several rows of data, each with a different set of features, and the last column should be Yes or No, as our final decision:

Once all the data is created and put into tables, we need to convert the previous features into a representation that the computer can understand.

Since numbers are simpler, we will convert our features (categories) into a codebook through a process called encoding. This codebook effectively converts each value to an integer.

Note that we will pass our data class as input:

 

Next, we need to create the decision variables to use for the decision tree.

This tree will help us decide whether to accept a new job offer. For this decision, there will be several classes of inputs, and we'll specify them in the decision variable array, along with two possible decisions, yes or no.

The DecisionVariable array will hold the name of each category and the total number of possible attributes for that category. For example, a salary category has three possible values, high, average, or low. We specify the class name and the number 3. We then repeat this step for all but the last category (i.e. our decision):

Now that we have created the decision tree, we must teach it how to solve the problem we are trying to solve. In order to do this, we have to create a learning algorithm for this tree. Since we only have categorical values ​​for this example, the ID3 algorithm is the easiest choice.

Once the learning algorithm is run, it is trained and ready for use. We simply feed the algorithm a sample data set so it can give us an answer. In this case, the salary is good, the company culture is good, the benefits are good, and I can work from home. If the decision tree is trained correctly, the answer will be yes:

Numl

numl is a very famous open source machine learning toolkit. Like most machine learning frameworks, many of its examples use the Iris dataset, including the one we'll use for the decision tree.

Here is an example of our numl output:

Let's look at the code behind this example:

        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
            var description = Descriptor.Create<Iris>();
            Console.WriteLine(description);
            var generator = new DecisionTreeGenerator();
            var data = Iris.Load();
            var model = generator.Generate(description, data);
            Console.WriteLine("生成的模型:");
            Console.WriteLine(model);
            Console.ReadKey();
        }

The approach is not complicated, right? That's the beauty of using numl in your application; it's very easy to use and integrate.

The above code creates the descriptor and DecisionTreeGenerator, loads the Iris dataset, and generates the model. Here's just an example of the data being loaded:

        public static Iris[] Load()
        {
            return new Iris[]
            {
                new Iris { SepalLength = 5.1m, SepalWidth = 3.5m, PetalLength = 1.4m, PetalWidth = 0.2m, Class = "Iris-setosa" },
                new Iris { SepalLength = 4.9m, SepalWidth = 3m, PetalLength = 1.4m, PetalWidth = 0.2m, Class = "Iris-setosa" },
                new Iris { SepalLength = 4.7m, SepalWidth = 3.2m, PetalLength = 1.3m, PetalWidth = 0.2m, Class = "Iris-setosa" },
                new Iris { SepalLength = 4.6m, SepalWidth = 3.1m, PetalLength = 1.5m, PetalWidth = 0.2m, Class = "Iris-setosa" },
                new Iris { SepalLength = 5m, SepalWidth = 3.6m, PetalLength = 1.4m, PetalWidth = 0.2m, Class = "Iris-setosa" },
                new Iris { SepalLength = 5.4m, SepalWidth = 3.9m, PetalLength = 1.7m, PetalWidth = 0.4m, Class = "Iris-setosa" }
            };
        }

Accord.NET Decision Tree

Accord.NET framework also has its own decision tree example. It takes a different, more graphical approach to decision trees, but you can decide which decision tree you like and are most comfortable with by calling it.

       Once the data is loaded, you can create a decision tree and prepare it for learning. You'll see a plot of the data similar to this, using both X and Y categories:

The next tab will let you see the tree nodes, leaves and decisions. There is also a graphical view of the top-down tree on the right. The most useful information is in the tree view on the left, where you can see the nodes, their values, and the decisions made:

Finally, the last tab will allow you to perform model tests:

 

the code

The following is the learning code

            // 指定输入变量
            DecisionVariable[] variables =
            {
                new DecisionVariable("x", DecisionVariableKind.Continuous),
                new DecisionVariable("y", DecisionVariableKind.Continuous),
            };
            // 创建C4.5学习算法
            var c45 = new C45Learning(variables);

            // 使用C4.5学习决策树
            tree = c45.Learn(inputs, outputs);

            // 在视图中显示学习树
            decisionTreeView1.TreeSource = tree;

            // 获取每个变量(X和Y)的范围
            DoubleRange[] ranges = table.GetRange(0);

            // 生成一个笛卡尔坐标系
            double[][] map = Matrix.Mesh(ranges[0], 200, ranges[1], 200);

            // 对笛卡尔坐标系中的每个点进行分类
            double[,] surface = map.ToMatrix().InsertColumn(tree.Decide(map));
CreateScatterplot(zedGraphControl2, surface);
            //测试
            // 从整个源数据表创建一个矩阵
            double[][] table = (dgvLearningSource.DataSource as DataTable).ToJagged(out columnNames);

            //只获取输入向量值(前两列)
            double[][] inputs = table.GetColumns(0, 1);

            // 获取预期的输出标签(最后一列)
            int[] expected = table.GetColumn(2).ToInt32();


            // 计算实际的树输出
            int[] actual = tree.Decide(inputs);


            // 使用混淆矩阵来计算一些统计数据。
            ConfusionMatrix confusionMatrix = new ConfusionMatrix(actual, expected, 1, 0);
            dgvPerformance.DataSource = new[] { confusionMatrix };

            CreateResultScatterplot(zedGraphControl1, inputs, expected.ToDouble(), actual.ToDouble());

His values ​​are then fed into a confusion matrix. For those of you who are not familiar with this, let me briefly explain.

confusion matrix

A confusion matrix is ​​a table used to describe the performance of a classification model. It operates on a test dataset of known true values. This is how we came to the following conclusions.

true-positive

In this example, our prediction is yes, which is true.

true-negative

In this case, we predict no, which is true.

false-positive

In this case, we predicted yes, but it didn't. You may sometimes see this referred to as a type 1 error.

false-negative

In this case, we predicted "no", but the fact is "yes". Sometimes you may see this being a type 2 error.

Now, having said all that, we need to talk about two other important terms, precision and recall.

Let's describe them like this. It has rained every day for the past week. This is 7 days out of 7. very simple. A week later, you are asked how often did it rain last week?

remember

It is the ratio of the number of days you correctly recalled raining to the total number of correct events. If you say it rained for 7 days, that's 100%. If you say it rained for four days, 57% remember it. In this case, it means that your recall is not that precise, so we have precision to identify.

Accuracy

It is the ratio of the number of times you correctly recall that it will rain to the total number of days in that week.

For us, if our machine learning algorithm is good at recall, it doesn't necessarily mean it's good at precision. Makes sense? That gets into other things, like F1 points, which we'll leave for a later date.

Visualize error types

Here are some visualizations that might help:

Identify true positives and false negatives:

After calculating the statistics using the confusion matrix, create a scatterplot, identifying everything:

Summarize

In this chapter, we have spent a lot of time looking at decision trees; what they are, how we can use them, and how they can benefit us in our applications. In the next chapter, we will enter the world of Deep Belief Networks (DBNs), what they are, and how we can use them.

We'll even talk a bit about the computer's dreams, when it dreams!

Guess you like

Origin blog.csdn.net/wyz19940328/article/details/122574523