Data cleaning for big data processing

learning target:

1. Learn to find dirty data
2. Learn to clean data

Learning Content:

1. Definition and judgment standard of dirty data "dirty data"
2. Clean data in oracle database

study-time:

If you have oracle foundation, study for 6 hours

Learning Outputs:

1. One technical note
2. Relevant codes for data cleaning

ETL数据清洗

The principle of data cleaning (Data Cleaning) is to analyze the causes and existence forms of "dirty data", use existing technical means and methods to clean "dirty data", and transform the original unqualified data into data that meets the data quality requirements. Or data required by the application, thereby improving the data quality of the dataset.

Data cleaning can also be seen from the name to "wash out" the "dirty", which refers to the last procedure to find and correct identifiable errors in data files, including checking data consistency, dealing with invalid and missing values, etc. Because the data in the data warehouse is a collection of data oriented to a certain topic, these data are extracted from multiple business systems and contain historical data, so it is unavoidable that some data are wrong data, and some data are inconsistent with each other. Conflicts, these wrong or conflicting data are obviously unwanted, called "dirty data". We need to "wash away" the "dirty data" according to certain rules, which is data cleaning.

1. Incomplete data: This type of data is mainly due to the lack of some necessary information, such as the name of the supplier, the name of the branch, the lack of regional information of the customer, and the mismatch between the master table and the detailed table in the business system. This type of data is filtered out, and the missing content is written into different Excel files and submitted to the customer, and it is required to complete within the specified time. After completion, it is written into the data warehouse.

2 Remove unnecessary fields: For some field content, the data analysis process may not be used, so it needs to be deleted.
3. Format content

Guess you like

Origin blog.csdn.net/qq_22201881/article/details/125455014