2012-07-31
When Email is transmitted on the network, MIME (Multipurpose Internet Mail Extensions) is used. Mail transmission can only transmit US-ASCII characters, and other characters contained in the mail must be converted through a certain encoding before being transmitted. For mails whose Subject or/and attachment names are Chinese characters, some mail systems lack encoding (character encoding and transmission encoding) information, resulting in garbled characters. This article analyzes the encoding of the Email system in Android-Base64 and Quoted-Printable.
The subject and attachment name of the mail indicate the transmission code and character code in a short format. Character encoding can be UTF-8, GB2312, etc.; transmission encoding commonly used is BASE64 and Quoted-Printable. This article mainly focuses on transmission encoding. For the Unicode encoding of character encoding, please refer to "Unicode encoding and its implementation: UTF-16, UTF-8, and more ".
One, Base64 encoding
Base64 encoding is widely used in current network transmission. Base64 can convert the content to be converted into printable characters (including the character table'A'~'Z','a'~'z', '0'~'9','+','/', total 64, and'=').
Character table (64 characters, the index only needs 6bits, that is, the maximum is 0x3F):
index |
Corresponding character |
index |
Corresponding character |
index |
Corresponding character |
index |
Corresponding character |
0 |
A |
17 |
R |
34 |
i |
51 |
with |
1 |
B |
18 |
S |
35 |
j |
52 |
0 |
2 |
C |
19 |
T |
36 |
k |
53 |
1 |
3 |
D |
20 |
U |
37 |
l |
54 |
2 |
4 |
E |
21 |
V |
38 |
m |
55 |
3 |
5 |
F |
22 |
W |
39 |
n |
56 |
4 |
6 |
G |
23 |
X |
40 |
The |
57 |
5 |
7 |
H |
24 |
Y |
41 |
p |
58 |
6 |
8 |
I |
25 |
WITH |
42 |
q |
59 |
7 |
9 |
J |
26 |
a |
43 |
r |
60 |
8 |
10 |
K |
27 |
b |
44 |
s |
61 |
9 |
11 |
L |
28 |
c |
45 |
t |
62 |
+ |
12 |
M |
29 |
d |
46 |
u |
63 |
/ |
13 |
N |
30 |
e |
47 |
v |
|
|
14 |
THE |
31 |
f |
48 |
w |
|
|
15 |
P |
32 |
g |
49 |
x |
|
|
16 |
Q |
33 |
h |
50 |
Y |
|
|
The specific conversion rules are:
1. 3 characters are converted into 4 characters;
The 3 8Bits characters have 24Bits, and every 6 Bits forms an index of the BASE64 character table, and the converted characters are found through the index.
That is, a7..a0 b7..b0 c7..c0 -> A7..A2 A1A0B7..B4 B3..B0C7C6 C5..C0
A7.. The first character of A2 is indexed in the character table;
A1A0B7..B4 The second character is indexed in the character table;
B3..B0C7C6 The third character is indexed in the character table;
C5..C0 The fourth character is indexed in the character table.
2. After conversion, add a newline character every 76 characters;
3. The last characters less than 3 characters need to be treated specially
3.1 If the remaining two characters are not processed , then:
These two remaining characters and 0x00 form a data, get the index of three characters, and use'=' for the last character.
That is, a7..a0 b7..b0 0..0 -> A7..A2 A1A0B7..B4 B3..B000
A7.. The first character of A2 is indexed in the character table;
A1A0B7..B4 The second character is indexed in the character table;
B3..B0 00 The third character is indexed in the character table;
The fourth character:'='.
3.2 If one character remains unprocessed , then:
This remaining character and 0x0000 form a data, get the index of two characters, and use'=' for the last two characters.
That is, a7..a0 0..0 0..0 -> A7..A2 A1A0 0..0
A7.. The first character of A2 is indexed in the character table;
A1A0 0..0 The second character is indexed in the character table;
The third and fourth characters:'=','='.
Two, Quoted-Printable encoding
Quoted-Printable encoding is relatively simple, scan the content to be encoded, and process each byte:
- If it is a space character (0x20), replace it with'_';
- If it is [33, 127), and is not a special restricted character {=_?\"#$%&'(),.:;<>@[\\]^`{|}~}, directly add the original characters , No processing;
- For other characters, replace with'=' plus internal code information.
3. Expression format of Email Subject and attachment name
With Base64 and Quoted-Printable encoding methods, there must be a certain format to indicate which transmission encoding is used, and at the same time, the character encoding method used by the encoded characters must be specified.
The expression format of Email's Subject and attachment name: <prefix><charset>?<encodeMode>?<encodedContent><suffix>
among them,
- <prefix> is fixed as "=?";
- <charset> is the character encoding format;
- <encodeMode> is the transmission encoding format: B stands for Base64; Q stands for Quote-Printable
- <encodedContent> is the character string encoded as charset encoded with encodeMode
- <suffix> is fixed as "?="
For example, you want to use "Lu Jingjingjj9.jpg" as the subject or the name of the attachment to be transmitted via Email. The encoding process is as follows:
3.1. UTF-8 encoding
E59095 E699B6 E699B6 6A6A392E6A7067
吕 晶 晶 j j 9 . j p g
3.2. Base64 encoding
E59095 E699B6 E699B6 6A6A39 2E6A7067 3Bytes
E59095 -> 111001011001000010010101 二进制
-> 111001 011001 000010 010101 6Bits(二进制)
-> 57 25 2 21 索引(十进制)
-> '5' 'Z' 'C' 'V' 编码后的字符
E699B6 -> 111001101001100110110110 二进制
-> 111001 101001 100110 110110 6Bits(二进制)
-> 57 41 38 54 索引(十进制)
-> '5' 'p' 'm' '2' 编码后的字符
E699B6 -> 111001101001100110110110 二进制
-> 111001 101001 100110 110110 6Bits(二进制)
-> 57 41 38 54 索引(十进制)
-> '5' 'p' 'm' '2' 编码后的字符
6A6A39 -> 011010100110101000111001 二进制
-> 011010 100110 101000 111001 6Bits(二进制)
-> 26 38 40 57 索引(十进制)
-> 'a' 'm' 'o' '5' 编码后的字符
2E6A70 -> 001011100110101001110000 二进制
-> 001011 100110 101001 110000 6Bits(二进制)
-> 11 38 41 48 索引(十进制)
-> 'L' 'm' 'p' 'w' 编码后的字符
670000 -> 011001110000000000000000 二进制
-> 011001 110000 000000 000000 6Bits(二进制)
-> 25 48 索引(十进制)
-> 'Z' 'w' '=' '=' 编码后的字符
Encoding process:
- Group the content to be encoded ("LV Jingjing jj9.jpg" UTF-8 encoded content) according to a group of 3 bytes [Line#1];
- Split every 6bits to get the index in the character table [Line#3&4;Line#7&8; Line#11&12; Line#15&16; Line#19&20];
- Look up the table by index and get the encoded characters [Line#5; Line#9; Line#13; Line#7; Line#21];
- Process the last byte [Line#22~#25].
So, get Base64 encoding [Line#5;Line#9; Line#13; Line#7; Line#21]:
5ZCV5pm25pm2amo5LmpwZw ==
3.3. The final Base64 encoding result
Then according to the format, add the prefix, character encoding, transmission encoding and suffix to get:
=?UTF-8?B?5ZCV5pm25pm2amo5LmpwZw==?=
3.4. Quoted-Printable encoding result
If the transfer encoding uses Quoted-Printable encoding, you can get:
=?UTF-8?Q?=E5=90=95=E6=99=B6=E6=99=B6jj9.jpg?=
The coding process is relatively simple, readers can refer to the second part of the Quoted-Printable coding to analyze by themselves.
Fourth, the implementation of Email in Android
In the implementation of Android native Email, the encoding and decoding of Base64 and Quoted-Printable are implemented using the third-party open source package mime4j. Specifically, all Base64 / Quoted-Printable encoded fields can be decoded , but when sending mail, just Subject has been encoded on the attachment name not be encoded . This also led to the problem of garbled Chinese attachment names.
The use of transmission encoding and decoding is achieved through com.android.email.mail.internet. MimeUtility , calling org.apache.james.mime4j.decoder. DecoderUtil or org.apache.james.mime4j.codec. EncoderUtil .
4.1 Decoding
There are several static methods related to decoding in com.android.email.mail.internet. MimeUtility :
public static StringunfoldAndDecode(String s);
public static Stringunfold(String s);
public static Stringdecode(String s);
unfoldAndDecode contains two operation processes of unfold and decode. Unfold removes the CRLF of the encoded content; decode is the real decoding implementation.
decode调用org.apache.james.mime4j.decoder.DecoderUtil#decodeEncodedWords()
decodeEncodedWords () determined by the transmission coding, is selected by decodeB () for Base64 decoding; or by decodeQ for Quoted-Printable decode ().
4.2 Coding
There are several static methods in com.android.email.mail.internet. MimeUtility related to encoding:
public static StringfoldAndEncode(String s);
public static StringfoldAndEncode2(String s, int usedCharacters)
public static Stringfold(String s, int usedCharacters)
foldAndEncode did not do any operation, foldAndEncode2 really realized the encoding. foldAndEncode2 is implemented by org.apache.james.mime4j.codec.EncoderUtil# encodeIfNecessary .
4.2.1 Do you need coding
After encoding, the length of the string will be increased, and encoding is not mandatory. EncoderUtil # hasToBeEncoded () through the analysis of the original string to determine whether it must be encoded.
- If the string contains only general printable characters, encoding is not necessary;
- If the string contains control characters and characters greater than 127, it must be encoded.
4.2.2 Choice of encoding
The choice of encoding includes the choice of character encoding and the choice of transmission encoding.
The selection of character encoding is carried out by EncoderUtil# determineCharset ().
- If the UnicodeCodePoint of the characters in the string to be encoded is greater than 0xFF, perform UTF-8 encoding;
- If the UnicodeCodePoint in the character string to be encoded is greater than 0x7F, perform ISO-8859-1 encoding;
- Otherwise, US-ASCII encoding is performed.
The selection of transmission encoding is carried out through EncoderUtil# determineEncoding ().
determineEncoding looks at the proportion of the characters that need Quoted-Printable encoding in the string to be encoded. Only when the proportion that needs to be encoded is less than 30%, will the Quoted-Printable encoding be used, otherwise Base64 encoding will be used.
4.2.3 Implementation of encoding
By encodeB be Base64 encoded (); or by encodeQ for encoding Quoted-Printable ().
4.3 Solve the problem by adding coded information
In the implementation of Android Email, right
- The subject and attachment names and other fields of the received email have been decoded;
- Send / save the message, but on the Subject has been encoded on the attachment name not be encoded .
Therefore, when receiving an email with Chinese attachments sent by the Android Email client, the attachment name is garbled. The solution is to encode the attachment name as discussed in the previous paragraph of this article in the place where the email is sent or saved.
5. Issues still outstanding
The solution in 4.4 can solve the problem of newly sent emails, but for existing emails in stock, their attachment names are still garbled. And the unencoded mail is received by another mail client (such as Outlook), and the name of the attachment can be correctly parsed. This also shows that the client can decode it even without encoding and specifying the encoding format. It's just that the author has passed the experiment, and still didn't understand how to implicitly encode/decode. If anyone knows how to achieve this, I hope readers will let me know!
The following is the name of the attachment sent through the Android Email client as "吕晶晶jj9.jpg", the name of the attachment received, I don’t know how to encode/decode it?
UTF-8 name sent
E59095 E699B6 E699B6 6A6A392E6A7067
吕 晶 晶 j j 9 . j p g
Received name ( what kind of encoding is this? The following hexadecimal encoding is captured from the attachment name of the received email. Does anyone know the encoding principle? I hope you can enlighten me!)
C3A5C290C295 C3A6C299C2B6 C3A6C299C2B6 6A6A392E6A7067
吕 晶 晶 j j 9 . j p g