The Chinese in the oracle filter field is no longer foreign or foreign.

Table of contents

Foreword:

 1. Know yourself and the enemy

      1.1 Business scenario

     1.2 Error cases

2. Organizing ideas

      2.1 Comparison of storage length and string length

3. Are there any other ideas?

      3.1ascii table lookup method

     3.2 Formal cases

 4. Summary

Foreword:

        With the continuous deepening of digital construction, enterprises are paying more and more attention to enterprise data governance and assisting decision-making through data consumption. As a comprehensive service provider in the global new energy industry. Our company also encounters a lot of special business. Of course, every company will encounter business digitization and will always leave some room to deal with some "thankless business." That is to say, the business volume is very small, and standardized control is not worthwhile. Of course, these businesses will also be fully integrated online. Although the sparrow is small, it has all the internal organs. This resulted in the discovery during comprehensive business analysis of some very annoying "abnormal data" such as batches containing Chinese characters. Today I would like to share with you how to deal with these "foreign batches" of Chinese + English. Summarize the Chinese language processing applicable to any enterprise.

 1. Know yourself and the enemy

        Why does Chinese exist?

      1.1 Business scenario

        As shown in the figure above, the batch numbers of some customized business batches are not automatically generated by the system, but are manually entered into the system. Although there are certain system constraints, in order to facilitate viewing of the corresponding meanings, Chinese characters are inevitably added. I have checked a lot of SQ on filtering Chinese on the Internet. To be honest, it is almost meaningless and basically cannot meet the needs. Therefore, I want to organize it systematically to avoid wasting everyone's time. You can also check it at any time when you need it~

     1.2 Error cases

        Error example

select *
from table_name
where regexp_like(text_field, '[\u4E00-\u9FA5]');

         misinterpretation

        The above code uses the regexp_like function to implement the function of filtering Chinese characters by specifying the character unicode value range. The unicode value is set according to the range of Chinese characters. \u4E00 – \u9FA5 is the most commonly used character range in the Chinese character unicode encoding table. Under normal circumstances, it can meet the needs.

        Error effect example

        As shown in the figure below, the final batch is the same as without filtering.

         Cause Analysis

        For the specific reason, you can refer to the article I wrote before. In Oracle, we cannot filter Chinese based on the unicode range.

oracle replaces Chinese characters in strings_oracle removes Chinese characters in strings_They call me the technical director's blog-CSDN blog

2. Organizing ideas

      2.1 Comparison of storage length and string length

        When Chinese characters are stored in the Oracle database, one Chinese character will occupy more than two characters. It depends on the character set encoding of the database. Generally, the NLS_CHARACTERSET of the database is AL32UTF8 or UTF8 , that is, one Chinese character occupies three to four bytes . If NLS_CHARACTERSET is ZHS16GBK , one Chinese character occupies two bytes .

                         As shown in the figure above, a Chinese character occupies 2 bytes.

        So we can use the first method, which is through lengthb('Chinese')! =length('Chinese') to find the batch number containing Chinese. The correct example case is shown below

        Code:

--查询带中文的数据
 select 生产批次号   from   BI.QZ_zB_GCPJCSJ zb  
          where 1=1 
          and flag='已清洗'
          and LENGTH(生产批次号)!= LENGTHB(生产批次号);--查询中文批次

--查询非中文的数据
 select 生产批次号   from   BI.QZ_zB_GCPJCSJ zb  
          where 1=1 
          and flag='已清洗'
          and LENGTH(生产批次号)= LENGTHB(生产批次号);--查询非中文批次

        Effect:

       

         Interpretation:

        As shown in the picture above, batches containing Chinese characters can be found through LENGTH (production batch number)! = LENGTHB (production batch number) . Therefore, if we only want to see non-Chinese batches , we can achieve the goal by using LENGTH (production batch number) = LENGTHB (production batch number) . Isn't it very simple?

3. Are there any other ideas?

        Above we use Chinese storage rules to filter Chinese data. In fact, we can also use the rules of ascii tables to query Chinese data.

      3.1ascii table lookup method

        Batch rules generally include numbers, letters, special symbols and Chinese characters. If you remove the characters that are not in the ascii table, it will be Chinese. For details, please refer to the corresponding ascii table.

        Specifically, we replace the Chinese with a complex string that will not appear, such as "They call me technical director." If we determine whether the replaced string contains the characters "they call me technical director", we can. See the case below for details.

     3.2 Formal cases

        Code:

    select 生产批次号
            from BI.QZ_ZB_GCPJCSJ
          
           where instr(regexp_replace(生产批次号,
                                      '[' || chr(128) || '-' || chr(255) || ']',
                                      '他们叫我技术总监'),
                       '他们叫我技术总监',
                       1,
                       1) > 0
--将中文替换为一个不可能出现的字符,然后判断替换的字符串是否包含对应的字符

         Effect:

        Interpretation:

         We replace the Chinese with a complex string that will not appear, and then determine whether the replaced string contains the replaced characters. The corresponding replaced string can be Chinese .

 4. Summary

        In this article, all commonly used Chinese processing solutions are included, such as finding data containing Chinese, finding non-Chinese data, replacing Chinese with other characters, etc. Of course, sometimes we also need to get numbers, dates, etc. from a bunch of strings containing Chinese characters. For example, "2.2 yuan/jin", get 2.2 as the unit price. For example, "2023-06-22 We are together" gets the date of 2023-06-22 for analysis. I have written an article explaining the corresponding processing methods in the past. I hope it will be useful to everyone. I will no longer waste time searching for wrong tutorials on the Internet~

oracle-Replace Chinese, carriage return, line feed, intercept in reverse order, etc., find the approval date you want from a bunch of OA opinions_oracle replaces line breaks_They call me the technical director's blog-CSDN blog

Obtain the numeric part from the database string for data analysis_Intercept the 2345 database in the string 123456_They call me the technical director's blog-CSDN blog

       

Guess you like

Origin blog.csdn.net/qq_29061315/article/details/131422690