The picture below is from teacher Nie Hui's courseware 3-2:
1. Data verification and cleaning
Ensuring the correctness of the data input process is the process of error detection. The data after this process is Computerized raw data, and its data format, content and arrangement method are completely consistent with the text data, and conform to the coding principles of the coding system.
Purpose: To maintain the correctness of the data input process.
(1) Identify and delete duplicate cases
The overall idea: use "identify duplicate cases" to create an identification column to identify duplicate columns; use "select cases" to delete the identified duplicate columns.
Deleting Duplicate Cases
Generate a repeat column Repeat
to identify specific cases based on the value of the repeat column, and delete them
.
(2) Logical verification
The entered data is incorrect due to the negligence, concealment, and mis-filling of the fill-in, for example, the average use time of mobile devices reaches 30 hours. Therefore, the entered data needs to be logically checked. The method used is "use tabulation to see if there are logic problems." Related command: Analyze/tables/Customs Tables
First, we must analyze the data to determine the columns that have a logical relationship. In the two selected columns as shown in the figure below, if there is no network experience (0), then the online shopping spending limit should be 0.
Find out the abnormal value through the customized table.
Select the two columns that have a relationship as the row and column values.
You can find in the report that the values of the No column should be all 0, but 1 appears, indicating that logical error data has occurred.
Next, you need to delete these records:
when you customize the table, you can find that the wrong column based on this logical relationship has been deleted:
2. Data preparation
On the basis of error screening, further improve the data and prepare the data.
(1) Treatment of missing values
Use to replace missing missing values for missing value processing:
There are several ways to replace missing values as follows:
A. Series means: means using the mean of the entire sequence as a substitute
B. Mean of nearby points: means using the mean
of nearby points instead of C. Median of nearby points: means using the median of nearby points as a substitute
D. Linear interpolation: linear The difference method, which uses a certain linear combination of the two points before and after the missing value to fill in, is a weighted average.
E. Linear trend at point: It is a linear trend method, which means using the fitted value of the regression fitted line as a substitute value.
It should be noted in use: the five types of methods shown in the following figure to complete "class anxiety", it can be found that the "average value of adjacent points", "median value of adjacent points", and "linear interpolation" can not complete the values of one and two. , Because these methods require that the neighboring points are not empty.
(2) Deviation value
Analyze whether there is a deviation value in the attributes shown in the following figure:
Analysis Idea 1-Frequency Statistics
Check whether the value of this field is within the normal range (0~24). Therefore, the frequency statistics of descriptive statistics in the analysis can be used:
The following results can be obtained:
Analysis: It can be seen that under the basic statistics with a mean value of 4.19 and a standard deviation of 4.243, 30 is obviously an outlier. To be precise, it is an extreme value. It should be paid special attention.
Analysis Ideas 2-Box Plot
It can be found that the analysis idea 1 relies on common sense and is only practical in some situations. Another universal method is to use box plots: a distance of more than 1.5 times the box length is considered as a deviation value; a distance of more than 3 times the box length is an extreme value; the more extreme values and deviation values, the more the deviation is serious.
The picture below is from teacher Nie Hui's courseware 3-2:
SPSS drawing steps:
Extension-Bivariate Outlier Test
Sometimes, we need to compare the distribution of the data of interest after classifying the sample.
Task: Compare the difference in time spent online between boys and girls. Nominal variable (gender)-scale variable (time spent online)
analysis:
- There is little difference in the distribution of online time between boys and girls. The average time for boys is 3.70 hours and that for girls is 4.61 hours.
- The data distribution of girls is slightly scattered, showing that the standard deviation is 2.494 for boys and 5.459 for girls. The box plot also shows that some girls spend more time online. However, it is found from the deviation value that there is an extreme value of 54 in the female group, which is unreasonable and needs attention. It may be an incorrect input. This value affects the descriptive statistics of the female sample. It is recommended to check and eliminate it before analyzing.
Analyze ideas 3-outliers
The so-called outliers are to draw extreme value tables, check extreme value cases and their values, and observe whether the extreme value distance is reasonable.
to sum up
When determining the deviation value, it is generally necessary to conduct a comprehensive analysis of multiple charts to obtain the final conclusion. For online time analysis, you can use box plots, frequency distribution graphs, extreme value tables, etc. for analysis.