C # processing distortion problems during the conversion of the time shift-jis utf-8 encoded

When exporting CSV files recently encountered in doing the project, because the client side requirements to export a CSV file must be a shift-jis encoded CSV files, and stored in our database are stored in unicode, so there will be a lot of exporting? Coding, which Because of:

Staying code table to explain:

Shift_JIS

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

00

NO

SOH

STX

ETX

ROT

ENQ

ACK

BEL

BS

HT

LF

VT

FF

CR

SO

AND

10

ACCORDING TO

DC1

DC2

DC3

DC4

NAK

SYN

ETB

CAN

IN

SUB

ESC

FS

GS

RS

US

20

SP

!

"

#

$

%

&

'

(

)

*

+

,

-

.

/

30

0

1

2

3

4

5

6

7

8

9

:

;

<

=

>

?

40

@

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

50

P

Q

R

S

T

The

V

W

X

Y

WITH

[

¥

]

^

_

60

`

a

b

c

d

e

f

g

h

i

j

k

l

m

n

O

70

p

q

r

s

t

in

v

w

x

Y

with

{

|

}

~

OF THE

80

90

A0

Wo

§

I

©

E

O

Turbocharger

Interview

®

Tsu

B0

A

Lee

C

D

Oh

Mosquitoes

Key

Click

Ke

Koh

Support

Shea

Scan

Se

Seo

C0

Data

Ji

Tsu

Te

Door

Na

Two

J

Ne

Roh

Ha

Heat

Full

F

Ho

Ma

D0

Mi

Beam

Menu

Mode

Ya

Yu

Yo

La

Li

Le

Les

Russia

Wa

Down

E0

F0

 

Shift_JIS is a computer system used in Japan coding table. It can accommodate half-width and double-latin, Hiragana, Katakana, and Kanji symbols.

The reason it was named Shift_JIS, and placed it in the full form character, to avoid the half-width kana characters in 0xA1-0xDF originally placed.

Microsoft and IBM in Japanese computer systems that use this code table. This code table called CP932 .

Word nodules configuration

These characters are used to indicate a byte Shift_JIS.

ASCII characters (0x20-0x7E), but the "/" is "¥" to replace

ASCII control character (0x00-0x1F, 0x7F)

JIS X 0201 and the half-width katakana punctuation (0xA1-0xDF) in the standard

In the part of the operating system, 0xA0 used to place the "non-breaking space."

The following two bytes in the Shift_JIS characters represented.

JIS X 0208 character set of all characters

"The first byte" Using 0x81-0x9F, 0xE0-0xEF (total 47)

"Second byte" used 0x40-0x7E, 0x80-0xFC (total 188)

User-defined area

"The first byte" Using 0xF0-0xFC (total 47)

"Second byte" used 0x40-0x7E, 0x80-0xFC (total 188)

In Shift_JIS code table, not used 0xFD, 0xFE and 0xFF.

Microsoft and IBM Japanese computer system, 0xFA, 0xFB 0xFC and two-byte area, adding the JIS X 0208 388 symbols and characters not included.

 

Because a lot of coding unicode and shift-jis did not use, so when switching shift-jis no corresponding transcoding, so are to be replaced when converted into 63 byte, that is? Is displayed, because these data we keep the original byte code character string corresponding to the shift-jis replaced by the corresponding character can be explicit.

 

Our design ideas are as follows:

1, with a conversion table to process the stored coding table to be replaced and the character table.

2, in two ways to handle the transcoding process.

     a: encoding replaced, it did exhibit some special character string, but he is present, such as the null character, 0xa0, shift-jis where no corresponding coding. There are some special characters, such as utf-8 is a new byte [] {0xef, 0xbb, 0xbf} empty string.

     b: string before conversion replaced. The pay some obvious word strings can be stored. The ~ ~ replaced directly Replace replaced.

 

Problems will follow, we can only save at the table like 0xef, 0xbb, 0xbf such strings, how to convert new byte [] {0xef, 0xbb, 0xbf} it?

The way we deal with are as follows:

        private byte[] ConvertStringToByte(string originalStr)
        {
            if (string.IsNullOrEmpty(originalStr)) return null;
            string[] originalSplit = originalStr.Split(',');            
            int originalFirstValue = 0, originalSecondValue = 0, originalThirdValue = 0;
            byte[] resultByte;
            originalFirstValue = Convert.ToInt32(originalSplit[0].Trim(), 16);
            if (originalSplit.Length == 2)
            {
                originalSecondValue = Convert.ToInt32(originalSplit[1].Trim(), 16);
                resultByte = new byte[] { BitConverter.GetBytes(originalFirstValue)[0], BitConverter.GetBytes(originalSecondValue)[0] };
            }
            else  if (originalSplit.Length == 3)
            {
                originalSecondValue = Convert.ToInt32(originalSplit[1].Trim(), 16);
                originalThirdValue = Convert.ToInt32(originalSplit[2].Trim(), 16);
                resultByte = new byte[] { BitConverter.GetBytes(originalFirstValue)[0], BitConverter.GetBytes(originalSecondValue)[0], BitConverter.GetBytes(originalThirdValue)[0] };
            }
            else
            {
                resultByte = new byte[] { BitConverter.GetBytes(originalFirstValue)[0] };
            }
            return resultByte;
        }

 

 

 

The byte stream into corresponding incoming code. Then almost as far as we are processing logic to write code to be replaced.

code show as below:

       public string ReplaceString(string content)
        {
            List<MessyCodeHandleBE> messyCodeHandleBEList = RetrieveAll();

            foreach (MessyCodeHandleBE entity in messyCodeHandleBEList)
            {
                if (entity.ConvertType == MessyCodeHandleConvertTypeChoices.ENCODEREPLACE)
                {
                    content = content.Replace(Encoding.UTF8.GetString(ConvertStringToByte(entity.OriginalCode)), entity.ReplaceCode);
                }
                else
                {
                    content = content.Replace(entity.OriginalCode, entity.ReplaceCode);
                }
            }
            return content;
        }

 

And how to get a special character encoding can be calculated according to the following code with your own code is as follows:

        private string ConvertToShiftJis(string content)
        {
            Encoding orginal = Encoding.GetEncoding("utf-8");
            Encoding ShiftJis = Encoding.GetEncoding("Shift-JIS");
            byte[] unf8Bytes = orginal.GetBytes(content);
            byte[] myBytes = Encoding.Convert(orginal, ShiftJis, unf8Bytes);
            string JISContent = ShiftJis.GetString(myBytes);
            return JISContent;
        }

 

See byte code which when debugging, as shown:

image

 

239 is the hexadecimal 0xef, 16 hex 16 hex 187 is 0xbb, 191 is 0xbf.

 

to sum up

Search string is encoded as a corresponding shift-jis what time the corresponding 63 byte [] bytes, and then replace Replace OK. If you have any new discoveries, welcome message exchange.

 

 

Author: the Spring Yang

Source: http://www.cnblogs.com/springyangwc/

This article belongs to the author and blog Park total, welcome to reprint, but without the author's consent declared by this section must be retained, and given the original connection in the apparent position of the article page, otherwise the right to pursue legal responsibilities.

Reproduced in: https: //www.cnblogs.com/springyangwc/archive/2011/07/05/2098053.html

Guess you like

Origin blog.csdn.net/weixin_34221276/article/details/93340925