Data Cleaning Experiment

Part 1 string cleaning  

Experimental background :

Mainly introduce the three string cleaning steps in the conversion directory:

 

Experimental steps :

  1. conversion graph

 

2. Step configuration

enter. Create a new conversion named string_op,

Use the "Enter custom constant data (Data Grid)" step as input: On the "Metadata (Meta)" option page, create three fields with the type set to "String"

Enter the previous sample data in the "Data" tab

Add a "String Operation" step, create a jump to this step from the "Data Grid" step, and set the "String Operation" step as follows:        

Add an ID field to the input field, set "Trim Type" to "both" to remove the leading and trailing blank characters; add a CITY field to the input field, set "Lower/Upper" to "upper" to convert all of them to uppercase ; Add a CODE field to the input field, set "Digits" to "only" to filter out invalid letters.

Add "string replace

" step, create a jump to that step from the "String operations" step.

At this time, we need a regular expression to match the beginning of the CODE field, which can be written in the form of "^([0]*)".

This regular expression represents the beginning part of a string, which consists of any number of 0s.

In addition, when you need to use regular expressions for replacement, you need to set "use RegEx" to "Y" to indicate that we are looking for a regular match.

 

When using this step for cleaning, there is an assumption that the abbreviation and connector in front of the city name are exactly 3 characters, and the length of the city name will not exceed 100

"Cut from" is set to "3", which means cutting from the third character,

"Cut from" is set to "100", which means cutting up to the 99th character.

Note here that the number of characters described here starts counting from 0.

3. Conversion result

 

Part 2 field cleaning

Experimental background :

 

 

Experimental steps:

  1. Configuration and transition diagrams for steps

Transformation 1: Split the city field into multiple lines with the Split field into multiple lines step

Create a new conversion field_op, add an input step Data Grid, and enter the following data: Add three fields for this step: number, province, city, select the type as "String", and add the sample data as shown in the figure.

 

The new field is set to "City NEW"

The data in the example is separated by "," which is a Chinese comma, and the separator can be set to ","

But what if there are both Chinese commas and English commas, and even Chinese and English semicolons, or commas? Since the delimiter of this step supports regular expressions, it is advisable to set the delimiter to the regular form [,,;;,], enclose the possible delimiters in square brackets, and select the option "Dlimiter is a Regular Expression" That's it.

 Transformation 2: Splitting and merging of data with the Split Fields step and the Merge Fields step

  Create a new conversion field_op_1, the input of this step can copy the previous Data Grid step.

Add a "Split Fields (Split Fields)" step, split field step settings:

The split field is set to city and the delimiter is set to ","

Set four new fields, city 1, city 2, city 3, city 4, all of which are of type String

Merge Fields Select these four fields: City1, City2, City3, City4

Set the output field to city and set its length to 100

The delimiter is set to ";"

 

Transformation 3: Cleaning the data with a field selection step

Create a new conversion field_op_2, add an input step Data Grid, and input the following data: add five fields for this input step, and set the corresponding data type.

Then enter a set of data for example demonstration.

Add a "Select Values" step named "Select Values-remove".

In the first step, delete the "Age" field.

Add the "Age" field to the "Remove" option page and delete the "Age" field. At this point, you can preview the step, and you will find that the output record has no "Age" field.

In the second step, add a "Select Values" step and name it "Select Values-meta".

This step is mainly for modifying metadata settings. You only need to modify the date format of the "Birth" field in the "Metadata" option page, and change the data type of the "Salary" field.

The third step, add a "field selection (Select Values)" step, named "Select Values-alter".

In the "Select and Modify" tab, rename the "Sex" field to "Gender" and adjust the "Gender" field to follow the "Name" field.

 

Experiment summary:

       All good things are based on huge and neat data , however, the real data is: miscellaneous! dirty! chaos! "Wrong in! Wrong out!" The quality of data may be very poor, full of various problems, such as repeated data records, incomplete and missing data, inconsistent data from various sources, and what's more, some data is Incorrect. In order to use these data, we must do data cleaning on these data.

        Data cleaning is to try to detect and remove noise data and irrelevant data in the data set, process missing data, remove white noise in blank data fields and knowledge background, and solve data consistency and uniqueness problems, so as to achieve the purpose of improving data quality .

        Kettle does not have a single cleaning step cleaning job, it needs to combine multiple steps to complete. Through this experiment, I have some understanding of the cleaning of strings and fields, and I am familiar with some conversion steps and configurations in kettle.

 

 

 

Guess you like

Origin blog.csdn.net/weixin_56264090/article/details/128828077