SPSS-Data Checking and Maintenance

The picture below is from teacher Nie Hui's courseware 3-2:
Insert picture description here

1. Data verification and cleaning

Ensuring the correctness of the data input process is the process of error detection. The data after this process is Computerized raw data, and its data format, content and arrangement method are completely consistent with the text data, and conform to the coding principles of the coding system.

Purpose: To maintain the correctness of the data input process.

(1) Identify and delete duplicate cases

The overall idea: use "identify duplicate cases" to create an identification column to identify duplicate columns; use "select cases" to delete the identified duplicate columns.

Insert picture description here
Insert picture description here
Insert picture description here
Deleting Duplicate Cases
Insert picture description here
Generate a repeat column Repeat
Insert picture description here
to identify specific cases based on the value of the repeat column, and delete them
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
.

(2) Logical verification

The entered data is incorrect due to the negligence, concealment, and mis-filling of the fill-in, for example, the average use time of mobile devices reaches 30 hours. Therefore, the entered data needs to be logically checked. The method used is "use tabulation to see if there are logic problems." Related command: Analyze/tables/Customs Tables

First, we must analyze the data to determine the columns that have a logical relationship. In the two selected columns as shown in the figure below, if there is no network experience (0), then the online shopping spending limit should be 0.
Insert picture description here
Find out the abnormal value through the customized table.

Insert picture description here
Select the two columns that have a relationship as the row and column values.
Insert picture description here
You can find in the report that the values ​​of the No column should be all 0, but 1 appears, indicating that logical error data has occurred.
Insert picture description here
Next, you need to delete these records:
Insert picture description here
when you customize the table, you can find that the wrong column based on this logical relationship has been deleted:
Insert picture description here

2. Data preparation

On the basis of error screening, further improve the data and prepare the data.

(1) Treatment of missing values

Use to replace missing missing values ​​for missing value processing:
Insert picture description here

There are several ways to replace missing values ​​as follows:
Insert picture description here

A. Series means: means using the mean of the entire sequence as a substitute
B. Mean of nearby points: means using the mean
of nearby points instead of C. Median of nearby points: means using the median of nearby points as a substitute
D. Linear interpolation: linear The difference method, which uses a certain linear combination of the two points before and after the missing value to fill in, is a weighted average.
E. Linear trend at point: It is a linear trend method, which means using the fitted value of the regression fitted line as a substitute value.

It should be noted in use: the five types of methods shown in the following figure to complete "class anxiety", it can be found that the "average value of adjacent points", "median value of adjacent points", and "linear interpolation" can not complete the values ​​of one and two. , Because these methods require that the neighboring points are not empty.

Insert picture description here

(2) Deviation value

Analyze whether there is a deviation value in the attributes shown in the following figure:
Insert picture description here

Analysis Idea 1-Frequency Statistics

Check whether the value of this field is within the normal range (0~24). Therefore, the frequency statistics of descriptive statistics in the analysis can be used:
Insert picture description here
Insert picture description here

The following results can be obtained:
Insert picture description here

Analysis: It can be seen that under the basic statistics with a mean value of 4.19 and a standard deviation of 4.243, 30 is obviously an outlier. To be precise, it is an extreme value. It should be paid special attention.

Analysis Ideas 2-Box Plot

It can be found that the analysis idea 1 relies on common sense and is only practical in some situations. Another universal method is to use box plots: a distance of more than 1.5 times the box length is considered as a deviation value; a distance of more than 3 times the box length is an extreme value; the more extreme values ​​and deviation values, the more the deviation is serious.
The picture below is from teacher Nie Hui's courseware 3-2:

Insert picture description here

SPSS drawing steps:

Insert picture description here
Insert picture description here
Insert picture description here

Extension-Bivariate Outlier Test
Sometimes, we need to compare the distribution of the data of interest after classifying the sample.
Task: Compare the difference in time spent online between boys and girls. Nominal variable (gender)-scale variable (time spent online)
Insert picture description here
Insert picture description here
Insert picture description here

analysis:

  • There is little difference in the distribution of online time between boys and girls. The average time for boys is 3.70 hours and that for girls is 4.61 hours.
  • The data distribution of girls is slightly scattered, showing that the standard deviation is 2.494 for boys and 5.459 for girls. The box plot also shows that some girls spend more time online. However, it is found from the deviation value that there is an extreme value of 54 in the female group, which is unreasonable and needs attention. It may be an incorrect input. This value affects the descriptive statistics of the female sample. It is recommended to check and eliminate it before analyzing.

Analyze ideas 3-outliers

The so-called outliers are to draw extreme value tables, check extreme value cases and their values, and observe whether the extreme value distance is reasonable.

Insert picture description here

Insert picture description here
Insert picture description here
Insert picture description here

to sum up

When determining the deviation value, it is generally necessary to conduct a comprehensive analysis of multiple charts to obtain the final conclusion. For online time analysis, you can use box plots, frequency distribution graphs, extreme value tables, etc. for analysis.

Insert picture description hereInsert picture description hereInsert picture description here

Guess you like

Origin blog.csdn.net/MaoziYa/article/details/114858186
Recommended