Base64 and Quoted-Printable encoding in Android mail

2012-07-31

When Email is transmitted on the network, MIME (Multipurpose Internet Mail Extensions) is used. Mail transmission can only transmit US-ASCII characters, and other characters contained in the mail must be converted through a certain encoding before being transmitted. For mails whose Subject or/and attachment names are Chinese characters, some mail systems lack encoding (character encoding and transmission encoding) information, resulting in garbled characters. This article analyzes the encoding of the Email system in Android-Base64 and Quoted-Printable.

The subject and attachment name of the mail indicate the transmission code and character code in a short format. Character encoding can be UTF-8, GB2312, etc.; transmission encoding commonly used is BASE64 and Quoted-Printable. This article mainly focuses on transmission encoding. For the Unicode encoding of character encoding, please refer to "Unicode encoding and its implementation: UTF-16, UTF-8, and more ".

One, Base64 encoding

Base64 encoding is widely used in current network transmission. Base64 can convert the content to be converted into printable characters (including the character table'A'~'Z','a'~'z', '0'~'9','+','/', total 64, and'=').

Character table (64 characters, the index only needs 6bits, that is, the maximum is 0x3F):

index	Corresponding character	index	Corresponding character	index	Corresponding character	index	Corresponding character
0	A	17	R	34	i	51	with
1	B	18	S	35	j	52	0
2	C	19	T	36	k	53	1
3	D	20	U	37	l	54	2
4	E	21	V	38	m	55	3
5	F	22	W	39	n	56	4
6	G	23	X	40	The	57	5
7	H	24	Y	41	p	58	6
8	I	25	WITH	42	q	59	7
9	J	26	a	43	r	60	8
10	K	27	b	44	s	61	9
11	L	28	c	45	t	62	+
12	M	29	d	46	u	63	/
13	N	30	e	47	v
14	THE	31	f	48	w
15	P	32	g	49	x
16	Q	33	h	50	Y

The specific conversion rules are:

1. 3 characters are converted into 4 characters;

The 3 8Bits characters have 24Bits, and every 6 Bits forms an index of the BASE64 character table, and the converted characters are found through the index.

That is, a7..a0 b7..b0 c7..c0 -> A7..A2 A1A0B7..B4 B3..B0C7C6 C5..C0

A7.. The first character of A2 is indexed in the character table;

A1A0B7..B4 The second character is indexed in the character table;

B3..B0C7C6 The third character is indexed in the character table;

C5..C0 The fourth character is indexed in the character table.

2. After conversion, add a newline character every 76 characters;

3. The last characters less than 3 characters need to be treated specially

3.1 If the remaining two characters are not processed , then:

These two remaining characters and 0x00 form a data, get the index of three characters, and use'=' for the last character.

That is, a7..a0 b7..b0 0..0 -> A7..A2 A1A0B7..B4 B3..B000

A7.. The first character of A2 is indexed in the character table;

A1A0B7..B4 The second character is indexed in the character table;

B3..B0 00 The third character is indexed in the character table;

The fourth character:'='.

3.2 If one character remains unprocessed , then:

This remaining character and 0x0000 form a data, get the index of two characters, and use'=' for the last two characters.

That is, a7..a0 0..0 0..0 -> A7..A2 A1A0 0..0

A7.. The first character of A2 is indexed in the character table;

A1A0 0..0 The second character is indexed in the character table;

The third and fourth characters:'=','='.

Two, Quoted-Printable encoding

Quoted-Printable encoding is relatively simple, scan the content to be encoded, and process each byte:

If it is a space character (0x20), replace it with'_';
If it is [33, 127), and is not a special restricted character {=_?\"#$%&'(),.:;<>@[\\]^`{|}~}, directly add the original characters , No processing;
For other characters, replace with'=' plus internal code information.

3. Expression format of Email Subject and attachment name

With Base64 and Quoted-Printable encoding methods, there must be a certain format to indicate which transmission encoding is used, and at the same time, the character encoding method used by the encoded characters must be specified.

The expression format of Email's Subject and attachment name: <prefix><charset>?<encodeMode>?<encodedContent><suffix>

among them,

<prefix> is fixed as "=?";
<charset> is the character encoding format;
<encodeMode> is the transmission encoding format: B stands for Base64; Q stands for Quote-Printable
<encodedContent> is the character string encoded as charset encoded with encodeMode
<suffix> is fixed as "?="

For example, you want to use "Lu Jingjingjj9.jpg" as the subject or the name of the attachment to be transmitted via Email. The encoding process is as follows:

3.1. UTF-8 encoding

E59095 E699B6 E699B6 6A6A392E6A7067
吕     晶     晶     j j 9 . j p g

3.2. Base64 encoding

E59095 E699B6 E699B6 6A6A39 2E6A7067  3Bytes
E59095 -> 111001011001000010010101     二进制
       -> 111001 011001 000010 010101  6Bits(二进制)
       -> 57     25    2      21      索引(十进制)
       -> '5'    'Z'   'C'    'V'     编码后的字符
E699B6 -> 111001101001100110110110     二进制
       -> 111001 101001 100110 110110  6Bits(二进制)
       -> 57     41    38     54      索引(十进制)
       -> '5'    'p'   'm'    '2'     编码后的字符
E699B6 -> 111001101001100110110110     二进制
       -> 111001 101001 100110 110110  6Bits(二进制)
       -> 57     41    38     54     索引(十进制)
       -> '5'    'p'   'm'    '2'     编码后的字符
6A6A39 -> 011010100110101000111001     二进制
       -> 011010 100110 101000 111001  6Bits(二进制)
       -> 26     38    40     57      索引(十进制)
       -> 'a'    'm'   'o'    '5'     编码后的字符
2E6A70 -> 001011100110101001110000     二进制
       -> 001011 100110 101001 110000  6Bits(二进制)
       -> 11     38    41     48      索引(十进制)
       -> 'L'    'm'   'p'    'w'     编码后的字符
670000 -> 011001110000000000000000     二进制
       -> 011001 110000 000000 000000  6Bits(二进制)
       -> 25     48                    索引(十进制)
       -> 'Z'    'w'   '='    '='     编码后的字符

Encoding process:

Group the content to be encoded ("LV Jingjing jj9.jpg" UTF-8 encoded content) according to a group of 3 bytes [Line#1];
Split every 6bits to get the index in the character table [Line#3&4;Line#7&8; Line#11&12; Line#15&16; Line#19&20];
Look up the table by index and get the encoded characters [Line#5; Line#9; Line#13; Line#7; Line#21];
Process the last byte [Line#22~#25].

So, get Base64 encoding [Line#5;Line#9; Line#13; Line#7; Line#21]:

5ZCV5pm25pm2amo5LmpwZw ==

3.3. The final Base64 encoding result

Then according to the format, add the prefix, character encoding, transmission encoding and suffix to get:

=?UTF-8?B?5ZCV5pm25pm2amo5LmpwZw==?=

3.4. Quoted-Printable encoding result

If the transfer encoding uses Quoted-Printable encoding, you can get:

=?UTF-8?Q?=E5=90=95=E6=99=B6=E6=99=B6jj9.jpg?=

The coding process is relatively simple, readers can refer to the second part of the Quoted-Printable coding to analyze by themselves.

Fourth, the implementation of Email in Android

In the implementation of Android native Email, the encoding and decoding of Base64 and Quoted-Printable are implemented using the third-party open source package mime4j. Specifically, all Base64 / Quoted-Printable encoded fields can be decoded , but when sending mail, just Subject has been encoded on the attachment name not be encoded . This also led to the problem of garbled Chinese attachment names.

The use of transmission encoding and decoding is achieved through com.android.email.mail.internet. MimeUtility , calling org.apache.james.mime4j.decoder. DecoderUtil or org.apache.james.mime4j.codec. EncoderUtil .

4.1 Decoding

There are several static methods related to decoding in com.android.email.mail.internet. MimeUtility :

public static StringunfoldAndDecode(String s);
public static Stringunfold(String s);
public static Stringdecode(String s);

unfoldAndDecode contains two operation processes of unfold and decode. Unfold removes the CRLF of the encoded content; decode is the real decoding implementation.

decode调用org.apache.james.mime4j.decoder.DecoderUtil#decodeEncodedWords()

decodeEncodedWords () determined by the transmission coding, is selected by decodeB () for Base64 decoding; or by decodeQ for Quoted-Printable decode ().

4.2 Coding

There are several static methods in com.android.email.mail.internet. MimeUtility related to encoding:

public static StringfoldAndEncode(String s);
public static StringfoldAndEncode2(String s, int usedCharacters)
public static Stringfold(String s, int usedCharacters)

foldAndEncode did not do any operation, foldAndEncode2 really realized the encoding. foldAndEncode2 is implemented by org.apache.james.mime4j.codec.EncoderUtil# encodeIfNecessary .

4.2.1 Do you need coding

After encoding, the length of the string will be increased, and encoding is not mandatory. EncoderUtil # hasToBeEncoded () through the analysis of the original string to determine whether it must be encoded.

If the string contains only general printable characters, encoding is not necessary;
If the string contains control characters and characters greater than 127, it must be encoded.

4.2.2 Choice of encoding

The choice of encoding includes the choice of character encoding and the choice of transmission encoding.

The selection of character encoding is carried out by EncoderUtil# determineCharset ().

If the UnicodeCodePoint of the characters in the string to be encoded is greater than 0xFF, perform UTF-8 encoding;
If the UnicodeCodePoint in the character string to be encoded is greater than 0x7F, perform ISO-8859-1 encoding;
Otherwise, US-ASCII encoding is performed.

The selection of transmission encoding is carried out through EncoderUtil# determineEncoding ().

determineEncoding looks at the proportion of the characters that need Quoted-Printable encoding in the string to be encoded. Only when the proportion that needs to be encoded is less than 30%, will the Quoted-Printable encoding be used, otherwise Base64 encoding will be used.

4.2.3 Implementation of encoding

By encodeB be Base64 encoded (); or by encodeQ for encoding Quoted-Printable ().

4.3 Solve the problem by adding coded information

In the implementation of Android Email, right

The subject and attachment names and other fields of the received email have been decoded;
Send / save the message, but on the Subject has been encoded on the attachment name not be encoded .

Therefore, when receiving an email with Chinese attachments sent by the Android Email client, the attachment name is garbled. The solution is to encode the attachment name as discussed in the previous paragraph of this article in the place where the email is sent or saved.

5. Issues still outstanding

The solution in 4.4 can solve the problem of newly sent emails, but for existing emails in stock, their attachment names are still garbled. And the unencoded mail is received by another mail client (such as Outlook), and the name of the attachment can be correctly parsed. This also shows that the client can decode it even without encoding and specifying the encoding format. It's just that the author has passed the experiment, and still didn't understand how to implicitly encode/decode. If anyone knows how to achieve this, I hope readers will let me know!

The following is the name of the attachment sent through the Android Email client as "吕晶晶jj9.jpg", the name of the attachment received, I don’t know how to encode/decode it?

UTF-8 name sent

E59095 E699B6 E699B6 6A6A392E6A7067
吕     晶     晶     j j 9 . j p g

Received name ( what kind of encoding is this? The following hexadecimal encoding is captured from the attachment name of the received email. Does anyone know the encoding principle? I hope you can enlighten me!)

C3A5C290C295 C3A6C299C2B6 C3A6C299C2B6 6A6A392E6A7067
吕           晶           晶           j j 9 . j p g

Base64 and Quoted-Printable encoding in Android mail

Guess you like