Data analysis study notes-data preprocessing
On the one hand, data preprocessing is to improve the quality of data, on the other hand, it is necessary to make the data better adapted to specific mining techniques or tools.
The main contents of data preprocessing include: data cleaning, data integration, data transformation and data specification.
The knowledge points are summarized as follows:
The main process of data preprocessing
Data cleaning: It is mainly to delete irrelevant data, duplicate data in the original data set, smooth noise data, filter out data irrelevant to the mining theme, and deal with missing values and outliers.
Data integration: The process of combining multiple data sources and storing them in a consistent data store (such as a data warehouse).
Data transformation: normalize the data and transform the data into an "appropriate" form to suit the needs of mining tasks and algorithms.
Data reduction: Complex data analysis and mining on large data sets takes a long time. The data reduction generates new data sets that are smaller but maintain the integrity of the original data. It will be more efficient to analyze and mine on the data set after specification.