java get file encoding

Table of contents

1 Overview

2. Coding Basics

2.1. iso8859-1

2.2. GB2312/GBK

2.3. unicode

2.4. UTF

3. Shift operations in JAVA >> , << , >>>

 3.1. Shift Left<<

3.2. Shift right

3.3. Unsigned right shift >>>

4. JAVA get file encoding

4.1. Judging by the first three bytes of the file

4.2. Judging that the error rate of the first three bytes is still quite large, you can further read the fields of the file and judge the special encoded characters to determine the file encoding

4.3. Obtain the file encoding through the tool library cpdetector


1 Overview

In the following description, the word "Chinese" will be taken as an example. After looking up the table, we can know that its GB2312 encoding is "d6d0 cec4", its Unicode encoding is "4e2d 6587", and its UTF encoding is "e4b8ad e69687". Note that these two words do not have an ISO8859-1 encoding, but can be "represented" with an ISO8859-1 encoding.

2. Coding Basics

The earliest encoding is iso8859-1, which is similar to ascii encoding. However, in order to facilitate the representation of various languages, many standard encodings have gradually emerged, and the important ones are as follows.

2.1. iso8859-1

It belongs to single-byte encoding, and the maximum range of characters that can be represented is 0-255, which is applied to English series. For example, the code for the letter 'a' is 0x61=97. Obviously, the range of characters represented by the iso8859-1 encoding is very narrow and cannot represent Chinese characters. However, because it is a single-byte encoding, which is consistent with the most basic representation unit of a computer, it is still represented by iso8859-1 encoding in many cases. And on many protocols, this encoding is used by default. For example, although the word "Chinese" does not have iso8859-1 encoding, taking gb2312 encoding as an example, it should be "d6d0 cec4". When using iso8859-1 encoding, it will be disassembled into 4 bytes. Indicates: "d6 d0 ce c4" (in fact, when storing, it is also processed in bytes). And if it is UTF encoding, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this representation needs to be based on another encoding.

2.2. GB2312/GBK

This is the national standard code of Hanzi, which is specially used to represent Chinese characters. It is a double-byte code, and the English letters are consistent with iso8859-1 (compatible with iso8859-1 code). Among them, the gbk code can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, and gbk is compatible with gb2312 code.

2.3. unicode

This is the most unified encoding that can be used to represent characters in all languages, and it is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to the iso8859-1 encoding, the unicode encoding only adds a 0 byte in front, for example, the letter 'a' is "00 61".
It should be noted that fixed-length encoding is easy for computers to process (note that GB2312/GBK is not a fixed-length encoding), and unicode can be used to represent all characters, so unicode encoding is used for processing in many software, such as java.

2.4. UTF

Considering that unicode encoding is not compatible with iso8859-1 encoding, and it is easy to take up more space: because for English letters, unicode also needs two bytes to represent. So unicode is not convenient for transmission and storage. Therefore, utf encoding is produced, utf encoding is compatible with iso8859-1 encoding, and can also be used to represent characters in all languages. However, utf encoding is an indeterminate length encoding, and the length of each character varies from 1-6 bytes. In addition, utf encoding comes with a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.
Note that although utf is used to use less space, it is only relative to unicode encoding. If you already know that it is a Chinese character, using GB2312/GBK is undoubtedly the most economical. But on the other hand, it is worth noting that although utf encoding uses 3 bytes for Chinese characters, even for Chinese web pages, utf encoding will save more than unicode encoding, because the web page contains a lot of English characters.

3. Shift operations in JAVA >> , << , >>>

The binary expression of decimal 12345 is as follows:

 3.1. Shift Left<<

 Shifting one bit to the left is equivalent to multiplying by 2, and shifting to the right is equivalent to dividing by 2. But after shifting left by 18 bits, the first bit is 1, which is a negative number.

 When shifted to the left by 18 bits, it becomes a negative number, and when shifted to the right by 20 bits, it becomes an integer. The shift operation is not exactly the same as multiplying by 2.

Then the question comes, for the int type to perform a shift operation, does the 32-bit left shift directly add 32 0s on the right? Is the result 0?

The answer is definitely not the case. When the number of digits exceeds the 32-bit int type, the remainder is calculated first, and the number of digits to move is %32. <<36 is the same as <<4.

3.2. Shift right

The unsigned right shift operator is the same as the right shift operator, but the unsigned right shift operator complements 0 when shifting right, while the right shift operator complements the sign bit. In the right shift operator, the right 0 is added after the shift, because the positive number 12345 sign bit is 0, if it is 1, it should be filled with 1

3.3. Unsigned right shift >>>

The unsigned right shift operator is the same as the right shift operator, but the unsigned right shift operator adds 0 when shifting right

4. JAVA get file encoding

ANSI: No Format Definition  
Unicode: The first two bytes are FFFE Unicode documents start with 0xFFFE
Unicode big endian: The first two bytes are FEFF  
UTF-8: The first two bytes are EFBB UTF-8 starts with 0xEFBBBF

4.1. Judging by the first three bytes of the file

public static String codeString(String fileName) throws Exception {
        BufferedInputStream bin = new BufferedInputStream(new FileInputStream(fileName));
        int p = (bin.read() << 8) + bin.read();
        bin.close();
        String code = null;
 
        switch (p) {
        case 0xefbb:
            code = "UTF-8";
            break;
        case 0xfffe:
            code = "Unicode";
            break;
        case 0xfeff:
            code = "UTF-16BE";
            break;
        default:
            code = "GBK";
        }
 
        return code;
    }

4.2. Judging that the error rate of the first three bytes is still quite large, you can further read the fields of the file and judge the special encoded characters to determine the file encoding

 /**
     * 判断文本文件的字符集,文件开头三个字节表明编码格式。 
     * @param path
     * @return
     * @throws Exception
     * @throws Exception
     */
    public static String charset(String path) {
        String charset = "GBK";
        byte[] first3Bytes = new byte[3];
        try {
            boolean checked = false;
            BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path));
            bis.mark(0); // 读者注: bis.mark(0);修改为 bis.mark(100);我用过这段代码,需要修改上面标出的地方。 
            int read = bis.read(first3Bytes, 0, 3);
            if (read == -1) {
                bis.close();
                return charset; // 文件编码为 ANSI
            } else if (first3Bytes[0] == (byte) 0xFF && first3Bytes[1] == (byte) 0xFE) {
                charset = "UTF-16LE"; // 文件编码为 Unicode
                checked = true;
            } else if (first3Bytes[0] == (byte) 0xFE && first3Bytes[1] == (byte) 0xFF) {
                charset = "UTF-16BE"; // 文件编码为 Unicode big endian
                checked = true;
            } else if (first3Bytes[0] == (byte) 0xEF && first3Bytes[1] == (byte) 0xBB
                    && first3Bytes[2] == (byte) 0xBF) {
                charset = "UTF-8"; // 文件编码为 UTF-8
                checked = true;
            }
            bis.reset();
            if (!checked) {
                while ((read = bis.read()) != -1) {
                    if (read >= 0xF0)
                        break;
                    if (0x80 <= read && read <= 0xBF) // 单独出现BF以下的,也算是GBK
                        break;
                    if (0xC0 <= read && read <= 0xDF) {
                        read = bis.read();
                        if (0x80 <= read && read <= 0xBF) // 双字节 (0xC0 - 0xDF)
                            // (0x80 - 0xBF),也可能在GB编码内
                            continue;
                        else
                            break;
                    } else if (0xE0 <= read && read <= 0xEF) { // 也有可能出错,但是几率较小
                        read = bis.read();
                        if (0x80 <= read && read <= 0xBF) {
                            read = bis.read();
                            if (0x80 <= read && read <= 0xBF) {
                                charset = "UTF-8";
                                break;
                            } else
                                break;
                        } else
                            break;
                    }
                }
            }
            bis.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println("--文件-> [" + path + "] 采用的字符集为: [" + charset + "]");
        return charset;
    }

4.3. Obtain the file encoding through the tool library cpdetector

public static String getFileCharset(String filePath) throws Exception {
		CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
		/*ParsingDetector可用于检查HTML、XML等文件或字符流的编码,
		 * 构造方法中的参数用于指示是否显示探测过程的详细信息,为false不显示。
	    */
		detector.add(new ParsingDetector(false));
		/*JChardetFacade封装了由Mozilla组织提供的JChardet,它可以完成大多数文件的编码测定。
		 * 所以,一般有了这个探测器就可满足大多数项目的要求,如果你还不放心,可以再多加几个探测器,
		 * 比如下面的ASCIIDetector、UnicodeDetector等。
        */
		detector.add(JChardetFacade.getInstance());
		detector.add(ASCIIDetector.getInstance());
		detector.add(UnicodeDetector.getInstance());
		Charset charset = null;
		File file = new File(filePath);
		try {
			//charset = detector.detectCodepage(file.toURI().toURL());
			InputStream is = new BufferedInputStream(new FileInputStream(filePath));
			charset = detector.detectCodepage(is, 8);
		} catch (Exception e) {
			e.printStackTrace();
			throw e;
		}
 
		String charsetName = "GBK";
		if (charset != null) {
			if (charset.name().equals("US-ASCII")) {
				charsetName = "ISO_8859_1";
			} else if (charset.name().startsWith("UTF")) {
				charsetName = charset.name();// 例如:UTF-8,UTF-16BE.
			}
		}
		return charsetName;
	}

Guess you like

Origin blog.csdn.net/m0_48983233/article/details/122893008