Solve the problem of garbled characters after file upload and conversion format.

  When we use the file uploaded by the client to parse in our project, and insert the database operation after parsing, there is a very headache problem, that is the problem of garbled codes, because you don't know what encoding format the client uses, of course, users may upload A file with garbled characters locally, when we specify the reading format when reading the file, there may also be a problem of garbled characters after transcoding, especially when inserting into the database. This problem is uncontrollable because we cannot guarantee the reading The file is not garbled after being converted to the specified format'

 This time I used a .csv format file in the project for uploading and parsing. When using Excel to create an xlsx format file and rename it to csv format [someone may ask why not create cvs directly, ordinary users will do this~] , Locally opened files are normally uploaded to the server and converted to utf-8, and they will be garbled. Only files saved as cvs format are uploaded normally. Therefore, this kind of garbled data cannot be inserted into the database. For this reason, I found a method to judge garbled on the Internet. , Are hereby shared.


package test.qhk.main;

import test.qhk.utils.MessyCodelUtils;

public class MessyCodeMain {
    public static void main(String[] args) {
        String[] codes = {"测试乱码",
                "ن‹و•له ",
                "役‹瑥•阿긺",
                "测è¯"};
        for (int i = 0; i < codes.length; i++) {
            if(MessyCodelUtils.isMessyCode(codes[i])){
                System.out.println(String.format("数组索引 %s 将产生乱码", i));
            }
        }
        
    }
}


package test.qhk.utils;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MessyCodelUtils {
    /**
     *注:这是网上找到识别乱码方法
     */
      private static boolean isChinese(char c) {  
            Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);  
            if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS  
                    || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS  
                    || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A  
                    || ub == Character.UnicodeBlock.GENERAL_PUNCTUATION  
                    || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION  
                    || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {  
                return true;  
            }  
            return false;  
        }  
    /**
     * <h4>功能:[判断是否乱码][2018年1月19日 下午12:03:40][创建人:qinhongkun]</h4>
     * <h4></h4>
     * @param content
     * @return 
     */
        public static boolean isMessyCode(String strName) {  
            Pattern p = Pattern.compile("\\s*|\t*|\r*|\n*");  
            Matcher m = p.matcher(strName);  
            String after = m.replaceAll("");  
            String temp = after.replaceAll("\\p{P}", "");  
            char[] ch = temp.trim().toCharArray();  
            float chLength = 0 ;  
            float count = 0;  
            for (int i = 0; i < ch.length; i++) {  
                char c = ch[i];  
                if (!Character.isLetterOrDigit(c)) {  
                    if (!isChinese(c)) {  
                        count = count + 1;  
                    }  
                    chLength++;   
                }  
            }  
            float result = count / chLength ;  
            if (result > 0.4) {  
                return true;  
            } else {  
                return false;  
            }  
        }  

}


If there is anything unreasonable, or a better way to deal with it, please comment~

Guess you like

Origin blog.csdn.net/Qin_HongKun/article/details/79176186