SPSSPRO data analysis - CSI data preprocessing, dimensionality reduction

Table of contents

I. Introduction

2. Data preparation

3. Preprocessing 

4. Perform dimensionality reduction tasks

5. Normality detection 

6. Code function


I. Introduction

SPSSPRO is a brand-new online data analysis platform, which can be used for scientific research data analysis, mathematical modeling, etc. For those newcomers who can't program or just enter scientific research, this tool is perfect. Of course, I only used it for modeling a long time ago, so I am a little suspicious of Wu Dasao in front of Guan Gong.

2. Data preparation

1. First, prepare a piece of data. This data needs information such as the table header. I will take a piece of CSI amplitude data as an example (300 rows*30 columns). Of course, the header can be printed by itself, or MATLAB or other programs can be used for labeling. MATLAB puts the header label:

T = array2table(raw_amp);
writetable(T,'SpassTest.xlsx');

 2. Import the generated form into spaspro and view the data:

3. Preprocessing 

1. Data processing --> abnormal value processing

2. Select three times the standard deviation for preprocessing. The detected outliers can be directly eliminated, or replaced by median, mean, mode, etc., depending on your task. Just drag the m-dimensional quantification into the selected variable.

 3. Generate the processed data, the name of the table header is determined by the fourth item in the above figure (we choose the default):

4. Perform dimensionality reduction tasks

1. Choose an appropriate dimensionality reduction algorithm according to your task. Here, choose the PCA algorithm for linear dimensionality reduction. If your data is nonlinear, you can use the KPCA dimensionality reduction algorithm. The total variance explanation rate indicates how much information is retained after dimensionality reduction, and it is generally selected between 90% and 99%, which can be determined according to your own tasks.

2. Generate dimensionally reduced data, here we are from 30 dimensions to 15 dimensions

3. Correlation Analysis of Dimensionality Reduction

Data Analysis --> Select Analysis Project --> Select Correlation Analysis

 Generate a correlation heat map after dimension reduction, from which it can also be seen that the features after dimension reduction are orthogonal (that is, they are not correlated with each other, and the correlation coefficient is 0)

5. Normality detection 

1. Select Algorithm --> Descriptive Analysis --> Normality Detection, taking the data of the first dimension as an example:

 The figure above shows the normality test histogram of the Dim3 data. If the normality chart is basically bell-shaped (high in the middle and low at both ends), it means that although the data is not absolutely normal, it is basically acceptable as a normal distribution. . From the test results, the 30 dimensions are basically acceptable as normal distribution.

2. Normality test PP diagram

 The figure above shows the fitting situation between the cumulative probability (P) and the normal cumulative probability (P) calculated by Dim1. The higher the degree of fitting, the more it obeys the normal distribution. From the test results, the 30 dimensions are basically acceptable as normal distribution.

6. Code function

1. SPSSPRO can directly compile python language

2. In addition to providing common py libraries, you can also install some open source libraries yourself

 3. View the existing libraries of SPSSPRO, or the libraries installed by yourself

4. Use the code to visualize the imported data, for example, our data is 300*30, visualize it

(1) Import the corresponding library and data

import pandas as pd
import matplotlib.pyplot as plt
# 读取数据
data = pd.read_excel('SpassTest.xlsx')

 (2) Print data

(3) Visualize the original data and the data processed by three standard deviations

(4) We found that the preprocessing method of the system is not very good, so we write a preprocessing program in the notebook for processing 

 Do you feel that the effect is much better, act quickly and start your journey of data analysis.

Guess you like

Origin blog.csdn.net/qq_53860947/article/details/131325834