Machine Learning Experiment - Data Perception and Visualization

Experiment title: Data Perception and Visualization

1. Purpose of the experiment:

1. Establish a perceptual understanding of machine learning data sets
2. Master the method of using numpy to create matrices and perform basic matrix calculations

2. Experimental steps:

① Randomly generate a linear regression data set
② Randomly generate a linearly separable two-class classification data set
③ Randomly generate a linearly separable multi-class classification data set
 The sample label is a one-hot vector
 The sample label is a scalar

3. Experimental results:

① Randomly generate a linear regression data set

Experiment code:
insert image description here

Experimental results:
parameter ① samples=10 dim=5 var = 0.02
insert image description hereinsert image description here

Parameter② samples=10 dim=5 var = 0.05
insert image description here
insert image description here

Parameter③ samples=5 dim=5 var = 0.02
insert image description here

Parameter ④ samples=10 dim=3 var = 0.02

insert image description here
insert image description here

Question 1: Select a sample and observe the relationship between samples, weights and their labels

This code will generate a dataset with 10 samples and 5 features, where the values ​​for each feature are randomly drawn from a Gaussian distribution with mean 0 and variance 1. It then randomly generates a weight vector w and computes the label y as a weighted sum of each sample x (i.e. y=xw) with some Gaussian noise added. Finally, it visualizes the relationship between features and labels for easy observation:

The experimental code is as follows:
insert image description here

Here is the image after generating the dataset:
insert image description here

It can be seen that for each feature, there is roughly a linear relationship between them and the label. This means we can use a linear regression model to predict labels.

Question 2: Modify the data set parameters samples, dim, var, and observe the changes in the dimensions of the generated matrix

The x, w, r, y matrix dimensions are independent of the parameter random error and the variance var of the normal distribution; the xw matrix dimension is related to the parameter sample feature number dim, which increases with the increase of dim and has a positive linear correlation; the xry matrix dimension It is related to the number of samples of the parameter samples, and it increases with the increase of samples and has a positive linear correlation.

2. Randomly generate linearly separable two-class classification data sets

Experiment code:
insert image description here

Experimental results:
insert image description here
insert image description here

Question 1: Select a sample and observe the relationship between the sample, weight and its probability, and label:

① Generate a linearly separable two-class classification data set. Suppose we need to generate a data set containing 100 samples, each of which has two features, the label is 0 or 1, the experimental code is as follows:
insert image description here
insert image description here

② Now we can select one of the samples to analyze its relationship with weights and labels. We select the 50th sample, and its specific value is:
insert image description here

③The output results are as follows:
insert image description here

④ This means that the sample has two features whose values ​​are -0.248 and -0.41. If we use a logistic regression model for classification, we can calculate the probability that this sample belongs to label 1 through the learned weights. Assuming that we have trained the model and obtained a set of weights w=[0.5, 0.7], the probability that the sample belongs to label 1 can be calculated as follows:

Sample label experiment results:

insert image description here

100 sample dataset visualization images:
insert image description here

3. Randomly generate a linearly separable multi-class classification data set-one-hot vector
Experimental code:
insert image description here

Experimental results:
insert image description hereinsert image description here
insert image description here
insert image description here

insert image description here

Question 1: Select a sample and observe the relationship between samples, weights and their probabilities, and labels

① Randomly generate a linearly separable multi-class classification dataset
We can use the scikit-learn library in Python to generate a linearly separable multi-class classification dataset. Among them, the make_classification function can be used to generate a dataset with multiple categories. Generate a dataset with 1000 samples, 10 features, and 5 classes, each class containing only one cluster center.

②Convert categories into one-hot vectors, and convert each category into one-hot vectors. This can be done using the OneHotEncoder class from scikit-learn. First convert y to a column vector, then use the OneHotEncoder class to convert it to a one-hot vector.

③Observe the relationship between samples, weights, their probability, and labels
We can use the LogisticRegression class in scikit-learn to train a logistic regression model, and observe the relationship between samples, weights, their probability, and labels. Randomly select a sample, and then use the trained logistic regression model to calculate the weight and probability of the sample. Finally, output the relationship between samples, weights and their probabilities, and labels.
Through the above steps, we can generate linearly separable multi-class classification data sets and one-hot vectors, and observe the relationship between samples, weights and their probabilities, and labels. This helps us better understand how the model works and spot possible problems.

Experiment code:
insert image description here

Experimental results:
insert image description here

Question 2: Modify the data set parameter classes and observe the change in the dimension of the generated matrix

When changing the parameter classes, the shapes of the generated feature matrix and label matrix will change accordingly. In particular, for a given number of samples n_samplesn_samples and number of features n_featuresn_features, when you increase the value of the parameter classes, the number of samples n_samples //classesn_samples//classes of each category will decrease, so every Differences between categories may be more pronounced.

Specifically, when the parameter classes take different values, the dimensions of the generated matrix are as follows:

When classes=2, the shape of the generated feature matrix is ​​(n_samples, n_features), and the shape of the label matrix is ​​(n_samples, 2).
When classes=3, the shape of the generated feature matrix is ​​still (n_samples, n_features), but the shape of the label matrix is ​​(n_samples, 3).
When classes=4, the shape of the generated feature matrix is ​​still (n_samples, n_features), but the shape of the label matrix is ​​(n_samples, 4).
By analogy, when you increase the value of classes, the number of columns of the label matrix increases accordingly, while the feature matrix remains the same.

4. Randomly generate a linearly separable multi-class classification dataset – the label is a scalar

Experiment code:
insert image description here

Experimental results:
insert image description hereinsert image description here

Question 1: Modify the code lines 6 and 7 to generate a scalar as a label for each sample. This scalar corresponds to the serial number of the category to which the sample belongs

Modified code:
insert image description here

The modified code will generate an integer array y, where y[i] represents the category number of the i-th sample (starting from 0)

operation result:
insert image description here

Question 2: Select a sample and check whether the generated label is correct

Experimental code: The np.argmax function is used to find the category to which each sample belongs, and it is used as a scalar label
insert image description here

The 42nd sample is selected for inspection, and the inspection results are:
insert image description here

Due to the randomness of the random number generator, you may get different results each time you run the code, but if the code is correct, its output should indicate that the sample belongs to the same class and label.

4. Expansion questions

insert image description here

For three-dimensional data sets, a three-dimensional scatterplot can be used for visualization. Since the number of categories is 5, we can use different colors or shapes to distinguish different categories.

Experiment code:
insert image description here

Experimental results:
insert image description here

5. Experimental experience

As an introductory experiment in machine learning, data perception and visualization are very important. In this lab, you can gain a better understanding of your dataset and problem by performing exploratory analysis and visualization of the data.
First of all, you need to have a comprehensive understanding of the dataset before proceeding with data visualization. This includes information about features in the dataset, variable types (discrete or continuous), missing values, and more. Various visualization tools such as histograms, scatterplots, boxplots, etc. can then be used to explore the relationship between different features and check for outliers or outliers. If there are outliers or outliers, you need to consider how to deal with or exclude them.
In addition, data visualization can also help you choose appropriate features for model training. By observing the relationship between different features and the target variable, it is possible to judge which features may have an impact on the prediction results, and select the best feature set accordingly.
In conclusion, data perception and visualization is an essential step in the practice of machine learning. Through exploratory analysis and visualization of data, you can better understand the data set and problem, select appropriate features, and find outliers. These are the foundations necessary for building a good machine learning model. These skills are essential for machine learning in the future. and data analysis tasks are very helpful.

Guess you like

Origin blog.csdn.net/xqmids99/article/details/130492394