Java basics of character encoding

A development character encoding:
Character encoding the development process:

Phase 1:
The computer only recognize numbers, all our data are based on figures to show that in the computer, because of limited English sign,
so the most significant byte of the requirement to use a 0. Each byte is between 0 to 127 numbers to represent, for example corresponding to a 65, a correspondence 97.
this is the American standard code for information Interchange -ASCII.

Phase 2:
With the popularity of computers in the world, many countries and regions see themselves in the characters introduced computer, such as Chinese characters.
In this case found that a byte can represent a range of numbers too small to contain all the Chinese characters, then the provisions of use two bytes to represent a character.
provides that: the original ASCII character code remains unchanged, use a byte, in order to distinguish a Chinese character with two ASCII characters,
Chinese characters maximum each byte is defined as 1-bit (binary Chinese is negative). this specification is GB2312 coding,
and later added more Chinese characters on the basis of GB2312, such as Chinese characters, also appeared GBK.

Phase 3:
a new problem in China is to recognize Chinese characters, but if the characters passed to other countries, the country code table is not included in the characters, in fact, show another symbol or garbled.
In order to solve the various countries because of localized characters the impact brought about by the coding, put all the symbols are encoded -Unicode worldwide unified coding.
at this point one character anywhere in the world are fixed, such as 'brother', are based on hexadecimal anywhere 54E5 made to represent.
the Unicode character code occupies two bytes in size.

Common character set:
ASCII: one byte can contain only 128 characters can not be represented symbols.
ISO-8859-1: (latin-1): one byte, a collection of Western European languages, Chinese characters can not be represented..
ANSI: two bytes, in simplified Chinese operating system refers to ANSI GB2312.
GB2312 / GBK / GB18030: two bytes, Chinese support.
UTF-. 8: is a variable length for the Unicode character encoding , called Unicode, it is one of the Unicode implementation.
Encoding the first byte is still compatible with ASCII, which makes software handling the original ASCII characters do not need or only a small part of the changes, you can continue to use.
Therefore, it has gradually become encoded e-mail, web pages and other stored or transmitted text applications, the use of priority. Internet Engineering Task Force (IETF) requires all Internet protocols are required to support UTF-8 encoding.

UTF-8 BOM: MS is engaged in out of the code, by default, 3 bytes, do not use this.

Storage of letters, numbers and characters:
storing letters and numbers no matter what character set is 1 byte.
Storage Character: Family GBK two bytes, UTF-8 family of 3 bytes.
You can not use a single byte character set (ASCII / ISO-8859-1) to store the Chinese.

Two character encoding and decoding operations:
Code: converts a string to byte array.
Decoded: The byte array into a string
must ensure that the same encoding and decoding a character, or garbled.
Here Insert Picture Description
Here Insert Picture Description
Three tools.

import java.io.UnsupportedEncodingException;
/**
* 字符编码工具类
*/
public class CharTools {

  /**
   * 转换编码 ISO-8859-1到GB2312
   * @param text
   * @return
   */
  public static final String ISO2GB(String text) {
    String result = "";
    try {
      result = new String(text.getBytes("ISO-8859-1"), "GB2312");
    }
    catch (UnsupportedEncodingException ex) {
      result = ex.toString();
    }
    return result;
  }

  /**
   * 转换编码 GB2312到ISO-8859-1
   * @param text
   * @return
   */
  public static final String GB2ISO(String text) {
    String result = "";
    try {
      result = new String(text.getBytes("GB2312"), "ISO-8859-1");
    }
    catch (UnsupportedEncodingException ex) {
      ex.printStackTrace();
    }
    return result;
  }
  /**
   * Utf8URL编码
   * @param s
   * @return
   */
  public static final String Utf8URLencode(String text) {
    StringBuffer result = new StringBuffer();

    for (int i = 0; i < text.length(); i++) {

      char c = text.charAt(i);
      if (c >= 0 && c <= 255) {
        result.append(c);
      }else {

        byte[] b = new byte[0];
        try {
          b = Character.toString(c).getBytes("UTF-8");
        }catch (Exception ex) {
        }

        for (int j = 0; j < b.length; j++) {
          int k = b[j];
          if (k < 0) k += 256;
          result.append("%" + Integer.toHexString(k).toUpperCase());
        }

      }
    }

    return result.toString();
  }

  /**
   * Utf8URL解码
   * @param text
   * @return
   */
  public static final String Utf8URLdecode(String text) {
    String result = "";
    int p = 0;

    if (text!=null && text.length()>0){
      text = text.toLowerCase();
      p = text.indexOf("%e");
      if (p == -1) return text;

      while (p != -1) {
        result += text.substring(0, p);
        text = text.substring(p, text.length());
        if (text == "" || text.length() < 9) return result;

        result += CodeToWord(text.substring(0, 9));
        text = text.substring(9, text.length());
        p = text.indexOf("%e");
      }

    }

    return result + text;
  }

  /**
   * utf8URL编码转字符
   * @param text
   * @return
   */
  private static final String CodeToWord(String text) {
    String result;

    if (Utf8codeCheck(text)) {
      byte[] code = new byte[3];
      code[0] = (byte) (Integer.parseInt(text.substring(1, 3), 16) - 256);
      code[1] = (byte) (Integer.parseInt(text.substring(4, 6), 16) - 256);
      code[2] = (byte) (Integer.parseInt(text.substring(7, 9), 16) - 256);
      try {
        result = new String(code, "UTF-8");
      }catch (UnsupportedEncodingException ex) {
        result = null;
      }
    }
    else {
      result = text;
    }

    return result;
  }

  /**
   * 编码是否有效
   * @param text
   * @return
   */
  private static final boolean Utf8codeCheck(String text){
    String sign = "";
    if (text.startsWith("%e"))
      for (int i = 0, p = 0; p != -1; i++) {
        p = text.indexOf("%", p);
        if (p != -1)
          p++;
        sign += p;
      }
    return sign.equals("147-1");
  }

  /**
   * 判断是否Utf8Url编码
   * @param text
   * @return
   */
  public static final boolean isUtf8Url(String text) {
    text = text.toLowerCase();
    int p = text.indexOf("%");
    if (p != -1 && text.length() - p > 9) {
      text = text.substring(p, p + 9);
    }
    return Utf8codeCheck(text);
  }
}
Published 99 original articles · won praise 2 · Views 2596

Guess you like

Origin blog.csdn.net/weixin_41588751/article/details/105327490