URL encoding and decoding in Javascript (detailed)

Summary

This article mainly introduces the related issues of URI encoding and decoding, gives a detailed description of which characters in url encoding need to be encoded and why they need encoding, and compares and analyzes several pairs of functions related to encoding and decoding in Javascript escape / unescape, encodeURI /decodeURI and encodeURIComponent /decodeURIComponent.

Preliminary knowledge

   foo://example.com:8042/over/there?name=ferret#nose 
   \_/  \______________/ \________/\_________/ \__/
     |                 |                        |                    |             |
scheme     authority               path             query      fragment

URI means Uniform Resource Identifier, usually what we call Url is just a type of URI. The format of a typical Url is shown above. The Url encoding mentioned below should actually refer to the URI encoding.

Why Url Encoding is Needed

Usually if something needs to be encoded, it means that it is not suitable for transmission. There are various reasons, such as the size is too large and contains private data. For Url, the reason for encoding is because some characters in Url will cause ambiguity .

For example, the Url parameter string uses the form of key=value key-value pair to pass parameters, and the key-value pairs are separated by ampersand, such as /s?q=abc&ie=utf-8. If your value string contains = or &, it will inevitably cause the server receiving the Url to parse the error, so the ambiguous & and = symbols must be escaped, that is, encoded.

For another example, the encoding format of Url is ASCII code, not Unicode, which means that you cannot include any non-ASCII characters in Url, such as Chinese. Otherwise, Chinese may cause problems if the character sets supported by the client browser and the server browser are different.

The principle of Url encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent those unsafe characters.

Which characters need to be encoded

The RFC3986 document stipulates that only English letters (a-zA-Z), numbers (0-9), -_.~4 special characters and all reserved characters are allowed in Url.

The RFC3986 document has made detailed suggestions on the encoding and decoding of Url, pointed out which characters need to be encoded so as not to cause the change of Url semantics, and explained why these characters need to be encoded.

There is no corresponding printable character in the US-ASCII character set

Only printable characters are allowed in Urls. The 10-7F bytes in the US-ASCII code all represent control characters, and none of these characters can directly appear in Url. At the same time, for 80-FF bytes (ISO-8859-1), because it has exceeded the byte range defined by US-ACII, it cannot be placed in Url.

reserved characters

Url can be divided into several components, protocol, host, path, etc. Some characters (:/?#[]@) are used to separate components. For example: colon is used to separate protocol and host, / is used to separate host and path, ? is used to separate path and query parameters, etc. There are also some characters (!$&'()*+,;=) used to separate each component, such as = is used to represent key-value pairs in query parameters, and the ampersand is used to separate query multiple key-value pairs. When ordinary data in a component contains these special characters, they need to be encoded.

The following characters are designated as reserved characters in RFC3986:

! * ' ( ) ; : @ & = + $ , / ? # [ ]

unsafe characters

There are also some characters that may cause ambiguity to the parser when they are placed directly in the Url. These characters are considered unsafe characters for a number of reasons.

space In the process of Url transmission, or the user is in the process of typesetting, or the text processing program is in the process of processing Url, it is possible to introduce insignificant spaces, or to remove those meaningful spaces.
quotation marks and <> Quotes and angle brackets are often used to delimit URLs in normal text
# Often used to represent bookmarks or anchors
% The percent sign itself is used as a special character when encoding unsafe characters, so it needs to be encoded by itself
{}|\^[]`~ Some gateways or transport agents tamper with these characters

 

It should be noted that for legal characters in Url, encoding and non-encoding are equivalent, but for the characters mentioned above, if they are not encoded, they may cause different Url semantics. Therefore, for Url, only ordinary English characters and numbers, special characters $-_.+!*'() and reserved characters can appear in unencoded Url . Other characters need to be encoded before they appear in the Url.

However, due to historical reasons, there are still some non-standard encoding implementations. For example, for the ~ symbol, although the RFC3986 document stipulates that Url encoding is not required for the tilde ~, there are still many old gateways or transmission agents that will

How to encode illegal characters in Url

Url encoding is also commonly known as percent-encoding (Url Encoding, also known as percent-encoding), because its encoding method is very simple, using the percent percent sign plus two characters—0123456789ABCDEF—represents a The hexadecimal form of the byte. The default character set used by Url encoding is US-ASCII. For example, the corresponding byte of a in US-ASCII code is 0x61, then the Url encoding is %61. We enter http://g.cn/search?q=%61%62%63 in the address bar, In fact, it is equivalent to searching for abc on google. Another example is that the byte corresponding to the @ symbol in the ASCII character set is 0x40, which is %40 after Url encoding.

Url encoding list of common characters:

Url encoding of reserved characters
! * " ' ( ) ; : @ &
%21 %2A %22 %27 %28 %29 %3B %3A %40 %26
= + $ , / ? % # [ ]
%3D %2B %24 %2C %2F %3F %25 %23 %5B %5D

For non-ASCII characters, you need to use a superset of the ASCII character set to encode the corresponding bytes, and then perform percent encoding on each byte . For Unicode characters, the RFC document recommends encoding them using utf-8 to get the corresponding bytes, and then performing percent-encoding on each byte. For example, the bytes obtained by using the UTF-8 character set for "Chinese" are 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and after Url encoding, "%E4%B8%AD%E6%96%87" is obtained.

If a byte corresponds to an unreserved character in the ASCII character set, the byte need not be represented by a percent sign . For example, "Url encoding", the bytes obtained using UTF-8 encoding are 0x55 0x72 0x6C 0xE7 0xBC 0x96 0xE7 0xA0 0x81, because the first three bytes correspond to the unreserved character "Url" in ASCII, so these three bytes It can be represented by the unreserved character "Url". The final Url encoding can be simplified to "Url%E7%BC%96%E7%A0%81", of course, if you use "%55%72%6C%E7%BC%96%E7%A0%81" of.

For historical reasons, there are some Url encoding implementations that do not fully follow this principle, which will be mentioned below.

The difference between escape, encodeURI and encodeURIComponent in Javascript

Javascript provides three pairs of functions to encode Url to get legal Url, they are escape / unescape, encodeURI / decodeURI and encodeURIComponent / decodeURIComponent. Since the process of decoding and encoding is reversible, only the encoding process is explained here.

These three encoding functions—escape, encodeURI, encodeURIComponent—are all used to convert unsafe and illegal Url characters into legal Url character representations. They have the following differences.

different security characters

The following table lists the safe characters for these three functions (i.e. the function does not encode these characters)

  safe characters
escape (69) */@+-._0-9a-zA-Z
encodeURI (82) ! # $ & '() * +, /:; =? @ -._ ~ 0-9a-zA-Z
encodeURIComponent(71个) ! '() * -._ ~ 0-9a-zA-Z

Compatibility is different

The escape function has existed since Javascript 1.0, and the other two functions were introduced in Javascript 1.5. But since Javascript 1.5 has become very popular, there is no compatibility problem when using encodeURI and encodeURIComponent.

Unicode characters are encoded differently

These three functions encode ASCII characters in the same way, using percent sign + two hexadecimal characters to represent them. But for Unicode characters, escape is encoded as %u xxxx , where xxxx is a 4-digit hexadecimal character used to represent unicode characters. This approach has been deprecated by the W3C. But the encoding syntax of escape is still retained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters and then percent-encode them . This is recommended by the RFC. Therefore, it is recommended to use these two functions instead of escape for encoding as much as possible.

Applicable to different occasions

encodeURI is used to encode a complete URI, and encodeURIComponent is used to encode a component of the URI.

From the safe character range table mentioned above, we will find that the character range encoded by encodeURIComponent is larger than that of encodeURI. As we mentioned above, reserved characters are generally used to separate URI components (a URI can be divided into multiple components, refer to the Preliminary Knowledge section) or subcomponents (such as the separator for query parameters in URIs), such as: is used to separate the scheme and the host, and the ? sign is used to separate the host and the path. Since the object manipulated by encodeURI is a complete URI, these characters have a special purpose in the URI, so these reserved characters will not be encoded by encodeURI, otherwise the meaning will change.

Components have their own data representation formats, but these data cannot contain reserved characters that separate components, otherwise, the separation of components in the entire URI will be confused. So using encodeURIComponent for a single component requires more characters to encode.

form submission

When the Html form is submitted, each form field will be Url encoded before being sent. For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for spaces is not %20, but the + sign. If the form is submitted using the Post method, we can see in the HTTP header that there is a Content-Type header with the value application/x-www- form-urlencoded. Most applications can handle this non-standard implementation of Url encoding, but in client-side Javascript, there is no function that can decode the + sign into a space, and you can only write a conversion function yourself. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example, we add in the Html header

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

In this way, the browser will use gb2312 to render this document (note that when this meta tag is not set in the HTML document, the browser will automatically select the character set according to the current user's preferences, and the user can also force the current website to use a specified character set set). When submitting the form, the character set used by the Url encoding is gb2312.

Does document charset affect encodeURI?

I encountered a very confusing problem when using Aptana before (why it refers specifically to aptana), that is, when using encodeURI, I found that the result obtained by its encoding is very different from what I thought. Below is my sample code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
    </head>
    <body>
        <script type="text/javascript">
            document.write(encodeURI("中文"));
        </script>
    </body>
</html> 

The running result outputs %E6%B6%93%EE%85%9F%E6%9E%83. Obviously this is not the result of Url encoding using the UTF-8 character set (search for "Chinese" on Google, and the Url shows %E4%B8%AD%E6%96%87).

So I was very suspicious at the time, whether encodeURI is still related to page encoding, but I found that under normal circumstances, if you use gb2312 for Url encoding, you will not get this result. Later, I finally found out that the problem was caused by the inconsistency between the character set used by the page file storage and the character set specified in the Meta tag . Aptana's editor uses the UTF-8 character set by default. That is to say, the UTF-8 character set is used when the file is actually stored. However, since gb2312 is specified in the Meta tag, at this time, the browser will parse the document according to gb2312, so naturally there will be an error in the "Chinese" string, because the "Chinese" string is encoded in UTF-8 and obtained The bytes are 0xE4 0xB8 0xAD 0xE6 0x96 0x87, these 6 bytes are decoded by the browser with gb2312, then you will get another three Chinese characters "涓枃" (one Chinese character in GBK occupies two bytes), The result of these three Chinese characters after passing in the encodeURI function is %E6%B6%93%EE%85%9F%E6%9E%83. Therefore, encodeURI still uses UTF-8 and will not be affected by the page character set.

Other issues related to Url encoding

Different browsers have different behaviors for the processing of URLs containing Chinese. For example, for IE, if you check the advanced setting "Always send Url in UTF-8", then the Chinese in the path part of the Url will be Url encoded in UTF-8 and sent to the server, while the Chinese in the query parameters will be sent to the server. Some use the system default character set for Url encoding. In order to ensure maximum interoperability, it is recommended that all components placed in Url explicitly specify a certain character set for Url encoding instead of relying on the browser's default implementation.

In addition, many HTTP monitoring tools or browser address bars will automatically decode the Url once (using the UTF-8 character set) when displaying the Url, which is why when you access Google in Firefox to search for Chinese, the address bar The displayed Url contains Chinese. But actually the original Url sent to the server is still encoded. You can see this by accessing location.href using Javascript on the address bar. Don't be fooled by these illusions when studying Url codecs.

references

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324485745&siteId=291194637