Big data architect must know series: data quality and data cleaning

Author: Zen and the Art of Computer Programming

1 Introduction

Overview

This article will introduce and elaborate on data quality and data cleaning from the following aspects:

  1. Purpose of data collection
  2. The significance of data quality assurance
  3. Data cleaning process
  4. Introduction to data cleaning tools and their characteristics
  5. Case analysis - user portrait data cleaning

Purpose of data collection

The purpose of collecting data is usually to generate valuable information that can help us discover things we did not expect or provide better decision support. Therefore, during the data collection process, it is necessary to verify data integrity, correctness, reliability and other factors to ensure that the data quality meets the requirements. So how to ensure data quality in an effective way? This requires knowledge related to data quality assurance, including data collection objectives, data collection methods, data quality control measures, data quality monitoring mechanisms, etc.

The significance of data quality assurance

Data Quality Assurance (Data Quality Assurance) is an important part of data security management. It mainly focuses on maintaining data quality, improving data quality level, and preventing data from being damaged, leaked, and tampered with. Data quality assurance can be divided into two levels: static quality assurance and dynamic quality assurance. Static quality assurance means that the data has been sufficiently guaranteed when it is stored, and no modifications are made to the data. Such data quality is immutable. Dynamic quality assurance refers to ensuring the high efficiency and reliability of data quality according to the data life cycle, ensuring the accuracy, completeness, consistency and timeliness of data. The significance of data quality assurance is mainly reflected in three aspects:

  • Protect data security: Data quality assurance can protect data from illegal infringement, abuse, leakage, tampering and other security risks. Such as personal information, customer information, product data, etc.
  • Improve data value: Data quality assurance can help companies collect real and effective information, and use data to generate new value through effective data analysis.
  • Promote business development: Data quality assurance can promote rapid business development and achieve sustainable growth.

Data cleaning process

Data Cleaning refers to processing, converting, and filtering original data to meet the needs and quality standards of data use. Data cleaning can be done from a variety of perspectives, such as structured, semi-structured, unstructured data cleaning, etc. Data cleaning work can also be carried out according to different stages. The following is the general process of data cleaning:

  1. Data collection: including business systems, third-party interface data, log files, etc.
  2. Data transmission: Due to factors such as network environment and hardware device performance limitations, data transmission may not be smooth, so data transmission often does not meet requirements.
  3. Data preservation: After data is transmitted and stored, there may be problems such as information loss, errors, lack, duplication, etc., resulting in low data quality.
  4. Data quality verification: In addition to regular data verification, data cleaning should also introduce more indicators to evaluate data quality. Such as unique identifier matching rate, duplicate record rate, value range, etc.
  5. Data cleaning: The purpose of data cleaning is to perform necessary conversion, filtering, deletion, etc. on the original data to remove or correct data errors and anomalies to ensure that the quality of the data meets the requirements.
  6. Data integration: After data cleaning, the data may need to be re-integrated before it can be used for subsequent data processing and analysis.

Introduction to data cleaning tools and their characteristics

Data cleaning refers to processing, converting, and filtering original data to make it meet the needs and quality standards of data use. Data cleaning tools can choose different tools for data cleaning according to different application scenarios, as shown in the following table:

Tool name Applicable scene Features
SQL Server Integration Services (SSIS) Structured data cleaning Customizable and powerful
Apache Hive Structured data cleaning Fast and suitable for processing large amounts of data
Hadoop MapReduce Distributed data cleaning Easy to use and highly scalable
Talend Open Studio Semi-structured/unstructured data cleaning Visual interface, complete functions
Data Wrangler Web data cleaning Simple to use, single function

Case analysis - user portrait data cleaning

Introduction to user portrait data

"User profiling" is a process that maps a large amount of user data into labels or characteristics of each user through a certain computing model by analyzing a large number of users' historical records, consumption habits, preferences, preference tendencies and other information. User portraits can be used for precision marketing, advertising, traffic optimization, hierarchical recommendations, real-time risk control, default warning, etc., to improve organizational effectiveness and increase the efficiency of business transformation.

For general Internet companies, user portraits are generally cleaned by "user portrait engineers".

For example, generally, user portrait data is divided into two types: "structured" data and "unstructured" data. Structured data refers to a certain structure, such as a user's name, age, address, occupation, etc., which are clearly defined; unstructured data is relatively complex, such as the user's browsing, search habits, comments, Liked content, etc.

For structured data, tools such as SQL Server Integration Services (SSIS) or Apache Hive are generally used for data cleaning. Its main task is to check and correct the data type and validity of each field, remove invalid data, delete duplicate data, clean abnormal data, convert timestamps into date format, etc.

For unstructured data, tools such as Talend Open Studio or Data Wrangler are generally used for data cleaning. Its main task is to summarize, merge, classify, label, deduplicate, associate, etc. data from different sources to form standardized, easy-to-use user portrait data. Among them, association, deduplication and labeling are very important.

Example: User profile data cleaning

mission details

A large e-commerce website has a very large amount of data and generates massive amounts of user data every day. Among them, user portrait data is relatively messy and irregular, and needs to be cleaned. Specific requirements are as follows:

  • Delete all records with null occupations because it is impossible to distinguish whether users with these null values ​​really do not need occupation information.
  • Replace the numbers in the mailbox with the @ symbol to facilitate subsequent analysis.
  • In the gender, M represents male, F represents female, and U represents unknown. In other cases, keep the original values.
  • Records with an age greater than 120 years old are uniformly marked as over 120 years old.
  • Records of those younger than 18 years old and registered earlier than July 2019 will be cleared.
  • Mark the user's location according to country and region, and delete provincial and municipal level information.
  • Replace the middle four digits of your mobile phone number with asterisks (*) to protect privacy.
  • Establish user portraits based on data such as interests, hobbies, and consumption habits, and carry out fine-grained labeling of users, including "foodies", "travel hobbies", "Internet celebrities", etc.
  • Conduct correlation analysis on product purchase, search, collection and other data to analyze what kind of products, brands and styles users like, etc.
  • Generate reports to count the number of users in various fields and release them to relevant personnel in advance.
solution
SSIS data cleaning

First, we can use SSIS to import user portrait data, and then set multiple cleaning rules to process the data.

  1. Delete all records with null occupations: Since users with null occupations cannot distinguish whether they really do not need occupation information, we can delete the record directly here. Use the filter function to determine whether to delete records based on whether the value of the occupation field is empty.

If you are familiar with the SSIS development language, you can also directly write the corresponding script and then call this script in the SSIS built-in expression function.

  1. Replace the numbers in the mailbox with the @ symbol: The mailbox contains user names in numeric form, which are not standardized, so they need to be replaced with the @ symbol. Using scripting functionality, you can perform arbitrary string operations between fields.
  1. In gender, M represents male, F represents female, and U represents unknown: the gender field may have formats such as M, m, male, male, etc. Here you can use script functions to unify various formats.
  1. Mark records with an age greater than 120 years old uniformly as over 120 years old: Here you can use the IF conditional statement to determine whether the value of the age field is greater than 120 years old. If it is greater than 120 years old, set the age value to 120 years old.
  1. Clear out records that are younger than 18 years old and registered earlier than July 2019: Here you can use an expression to calculate the time difference between the registration time and the current date, and filter based on age.
  1. Mark the user's location according to country and region, and delete provincial and municipal level information: this kind of information can be extracted from the IP address, but the IP address is sometimes encrypted, so this method cannot be used directly here. It is recommended to consider geolocation tags based on city location databases.
  1. Replace the middle four digits of the mobile phone number with asterisks (*) to protect privacy: It is best to only display the last four digits of the mobile phone number, but if you do not want to expose the real private information, you can replace it with asterisks.
  1. Establish user portraits based on data such as interests, hobbies, consumption habits, etc., and conduct fine-grained labeling of users: This type of label can use the observer mode to model the user's history, search, like, browsing and other behaviors, and then generate different Label. No more examples here.

  2. Conduct correlation analysis on product purchase, search, collection and other data to analyze what kind of products, brands and styles users like, etc.: This type of correlation analysis can be modeled using machine learning methods, but Since the data volume is large and involves a lot of private data, no examples are given here for the time being.

  3. Generate a report to count the number of users in each field: Through the above steps, we have cleaned the user portrait data, and then we can generate a report to count the number of users in each field.

通过结果,我们可以发现,原有用户画像数据中有些用户的性别、职业信息都已清洗完毕,手机号码替换成了星号,邮箱中用户名使用了@符号。而新生成的用户画像数据,则可以对用户的消费习惯、兴趣爱好等进行详细的分析。
Apache Hive data cleaning

For user portrait data, Apache Hive is generally used as the data warehouse. Hive provides rich SQL query capabilities and can easily complete data cleaning work.

  1. Delete all records with null occupation values: You can also use the SELECT or DELETE statement to complete the deletion.

    DELETE FROM user_profile WHERE job = '';
  2. Replace the numbers in the mailbox with the @ symbol: You can also use the SCRIPT or TRANSFORM command combined with a Python script to complete the replacement.

  3. In the gender, M represents male, F represents female, and U represents unknown: also use CASE WHEN...THEN...END or UDF function to complete the replacement.

  4. Unify the records that are older than 120 years old and mark them as over 120 years old: also use the CASE WHEN...THEN...END or UDF function to complete the marking.

  5. Clear out records that are younger than 18 years old and registered earlier than July 2019: Also use the WHERE clause to complete the filtering.

  6. Mark the user's location according to country and region, and delete provincial and municipal level information: use LZO compression, CREATE EXTERNAL TABLE to create an external table, and use SQl SELECT statements to complete the query.

  7. Replace the middle four digits of the mobile phone number with asterisks (*) to protect privacy: also use the CASE WHEN...THEN...END or UDF function to complete the replacement.

  8. Establish user portraits based on data such as interests, hobbies, consumption habits, etc., and conduct fine-grained labeling of users: This type of labeling can be completed using content-based recommendation algorithms, but currently manual labeling methods are generally used.

  9. Conduct correlation analysis on data such as product purchases, searches, and collections to analyze what products, brands, styles users like, etc.: This type of analysis can be completed using methods based on collaborative filtering, but currently it is generally It is completed using manual annotation method.

  10. Generate reports to count the number of users in various fields: This type of report can be completed using HQL or PL/SQL, but due to the large amount of data and the involvement of a lot of private data, no examples are given here.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133446379