Determine the encoding format of a text file

There are two character sets for files under Windows, one is ANSI and the other is Unicode.

For Unicode, Windows supports three encoding methods, one is little-endian encoding (Unicode), one is big-endian encoding (BigEndianUnicode), and the other is UTF-8 encoding.

We can distinguish which encoding a file belongs to from the header of the file. When the first two bytes of the header are FF FE, it is the little-endian encoding of Unicode; when the two bytes of the header are FE FF, it is the big-endian encoding of Unicode; when the first two bytes are EF BB When it is, it is Unicode's UTF-8 encoding; when it is not these, it is ANSI encoding.

As mentioned above, we can determine the encoding format of the file by reading the two bytes of the file header. The code is as follows (C# code):

System.Text.Encoding.Default in the program refers to the encoding of the current ANSI code page of the operating system.


public System.Text.Encoding  GetFileEncodeType(string filename)
{
    System.IO.FileStream fs = new System.IO.FileStream(filename, System.IO.FileMode.Open, System.IO.FileAccess.Read);
    System.IO.BinaryReader br = new System.IO.BinaryReader(fs);
    Byte[] buffer = br.ReadBytes(2);
    if(buffer[0]>=0xEF)
    {
        if(buffer[0]==0xEF && buffer[1]==0xBB)
        {
             return System.Text.Encoding.UTF8;
        }
        else if(buffer[0]==0xFE && buffer[1]==0xFF)
        {
             return System.Text.Encoding.BigEndianUnicode;
        }
        else if(buffer[0]==0xFF && buffer[1]==0xFE)
        {
             return System.Text.Encoding.Unicode;
        }
        else
        {
             return System.Text.Encoding.Default;
        }
    }
    else
    {
             return System.Text.Encoding.Default;
    }
}

Original address: http://www.cnblogs.com/swtseaman/archive/2011/05/17/2048689.html

Guess you like

Origin blog.csdn.net/gaoxu529/article/details/48649363