Matlab machine learning 1 (Machine Learning Onramp)

Typical workflow

Take a project dealing with handwritten letters as an example, and be familiar with the machine learning process of Matlab.
Insert picture description here

Import Data

Insert picture description here

Handwritten letters are stored as separate text files. Each file is separated by a comma and contains four columns: time stamp, horizontal position of the pen, vertical position of the pen, and pressure of the pen. The timestamp is the time (in milliseconds) that has passed since the start of data collection. Other variables are expressed in normalized units (0 to 1). For the position of the pen, 0 represents the bottom and left edge of the writing surface, and 1 represents the top and right edge.

letter = readtable('J.txt')	%将J.txt中的四列数据存储为一个table
plot(letter.X,letterY)	%绘制笔迹
axis equal	% 使横纵坐标轴保持相同的尺寸比例

Insert picture description here

Data processing

The pen position used to write data is measured in normalized units (0 to 1). However, the flat plate used to record data is not square. This means that a vertical distance of 1 corresponds to 10 inches, and the same horizontal distance corresponds to 15 inches. To correct this problem, the horizontal unit should be adjusted to the range [0 1.5] instead of [0 1].
Insert picture description here

letter = readtable("M.txt")
letter.X = 1.5*letter.X;	% 调整字母M的横纵坐标比例
plot(letter.X,letter.Y)
axis equal

Insert picture description here
Insert picture description here

The time stamp is the time record since the data collection, and what we are concerned about is the corresponding relationship (in seconds) between the position of the pen tip and the time from the beginning of writing.

letter.Time = letter.Time - letter.Time(1)
letter.Time = letter.Time/1000

plot(letter.Time,letter.X)
plot(letter.Time,letter.Y)

Insert picture description here
Insert picture description here

Calculating Features

Which aspects of these letters can be used to distinguish J from M or V? Our goal is not to use the original signal, but to calculate the value of extracting the entire signal into simple and useful information units. These information units are called features .
For letters J and M, a simple feature might be the aspect ratio (the height of the letter relative to the width). A J may be tall and narrow, while an M may be more square.
Compared with J and M, V is fast to write, so the duration of the signal may also be a significant feature.

letter = readtable("M.txt");
letter.X = letter.X*1.5;
letter.Time = (letter.Time - letter.Time(1))/1000
plot(letter.X,letter.Y)
axis equal

dur = letter.Time(end) % 计算写这个字母的持续时间
aratio = range(letter.Y)/range(letter.X) % 计算这个字母的长宽比

Viewing Features

We need to associate features with letters, that is, which features should be classified into which category.

load featuredata.mat
features % featuredata.mat文件中存有一个table:feature。包含三列共470组数据,如下图所示

Insert picture description here

% gscatter函数制作一个分组散点图,即根据分组变量对点进行着色的散点图。
gscatter(features.AspectRatio,features.Duration,features.Character)

Insert picture description here

Modeling

The classification model divides the space of predictors into multiple regions. Each area is assigned an output class. In this simple example with two predictors, you can see these areas on the plane.
Insert picture description here
There is no absolutely "correct" way to divide a plane into classes J, M, and v. Different classification algorithms will produce different divisions.
Insert picture description here
An easy way to classify observations is to use the same class as the closest class to the known example. This is called the k-nearest neighbor (kNN) model. You can fit the kNN model by passing a data table to the fitcknn function.

load featuredata.mat
features	% 用作建立模型的训练数据
testdata	% 用做检验模型的测试数据

% 第二个输入是分类的结果。输出是一个包含拟合模型的变量。
knnmodel = fitcknn(features,"Character") 
% 使用训练后的模型对测试数据进行预测
predictions = predict(knnmodel,testdata)

After building a model based on the data, you can use it to classify new observations. This only requires calculating the newly observed features and determining which area of ​​the prediction space they are in.

testdata contains the observations of the known correct class (stored in the Character variable). In this way, the model can be tested by comparing the predicted class with the real class.
Insert picture description here
By default, fitcknn fits a kNN model with k = 1. In other words, the model uses only one recent known example to classify a given observation. This makes the model sensitive to any outliers in the training data, such as the outliers highlighted in the image above. New observations near the outliers are likely to be misclassified. By increasing the value of k (ie using several adjacent most common classes), the model can be made less sensitive to specific observations in the training data. This usually improves the overall performance of the model. However, the performance of the model on any particular test set depends on the specific observations in that test set.
fitcknn of "NumNeighbors" parameter is used to specify the size k.
Insert picture description here

Evaluation model

How good is the kNN model? The table testdata contains known classes for testing observations. You can compare the known classes with the predictions of the kNN model to understand how the model performs on new data.

Calculate its error rate, and use the confusionchart function to visually observe its performance.

load featuredata.mat
testdata
knnmodel = fitcknn(features,"Character","NumNeighbors",5);
predictions = predict(knnmodel,testdata)

iscorrect = predictions == testdata.Character
accuracy = sum(iscorrect)/numel(predictions)

iswrong = predictions ~= testdata.Character
misclassrate = sum(iswrong)/numel(predictions)

confusionchart(testdata.Character,predictions);

Guess you like

Origin blog.csdn.net/Explore_OuO/article/details/108885088