kettle development-Day41-data cleaning string replacement

Foreword:

Yesterday we talked about diversion through the case/switch component to distinguish three major categories of data in the date, including the normally displayed data of 2023/7/12 2:59:58, one containing Chinese morning data , and the other is data containing Chinese afternoon . However, we found that the data stored directly in this way still contains a lot of abnormal data with ਍ and Chinese characters morning, afternoon, and the year is incomplete and only has 23. Obviously, these abnormal data will cause our data analysis to be unable to process. Therefore, we must clean these abnormal data and restore them to normal.

1. String replacement

1. Function introduction

As shown in the figure above, string replacement can turn string A into string B, so we can use the string replacement function to clear data containing " ਍" and so on .

2. Small cases

When replacing strings, if we simply replace A with B, we only need to select the input stream field that needs to be replaced, without using regular expressions, enter A in the search, use... to replace, enter B, and set If it is an empty string, just wait as shown in the figure below.

original string

The result after replacement

As shown in the figure above, we successfully changed the string ABaaABb→BBBBBBb because we selected case insensitivity, because a will also be replaced, so we need to choose the corresponding replacement data range according to our own needs.

2. Special applications

As mentioned earlier, we need to process special characters such as ਍ and Chinese characters, so we need to use regular expressions .

2.1 Regular expressions

As shown in the figure above, we use the regular expression ਍+ to match data containing ਍, so any position ਍ in the string will be replaced by a null value. Similar to Chinese morning and afternoon, we can use the regular expression morning + afternoon + to match the corresponding string and then replace it. The final setting effect is shown in the figure below.

2.2 Special treatment

As we mentioned before, the year in our string is incomplete. For example, 2023 displays 23, so we need to convert 23 to 2023. What needs to be noted here is that 23 may appear in our hours, minutes, and seconds. Therefore, when processing the year 23, we need to use ^23 to process, which means that only the 23 starting with 23 will be replaced with 2023, so it will not be Hours, minutes, and seconds have also been replaced with 2023. The corresponding effect is shown in the figure below.

The problem of the year 23 has been solved, we also need to replace 23.07.14 with 2023/07/14, so we need to replace "." with "/" at this time. What needs to be noted here is that we cannot directly . or / , because the corresponding ones are keywords, we need to use \. and \/ to complete the replacement of the corresponding strings. The final effect is shown below.

3. Summary

When applying string replacement for data cleaning, we can use regular expressions for fuzzy matching, but we need to pay attention to whether fuzzy matching will cause other data that should not be replaced to be replaced. For example, when replacing the year, the hours, minutes, and seconds are also replaced.

Also, when we find that the replaced string does not have the effect we expected, we need to consider whether we have used keywords, so we need to use the \ keyword to complete the corresponding replacement rules. Good luck~