Why URL Encoding

The article is reproduced from http://www.cnblogs.com/jerrysion/p/5522673.html

 

Why URL Encoding

 

We all know that the transmission of parameters in the Http protocol is in the form of "key=value". If you want to pass multiple parameters, you need to use the "&" symbol to separate the key-value pair. For example, "?name1=value1&name2=value2", when the server receives this kind of string, it will use "&" to separate each parameter, and then use "=" to separate the parameter value.

 

For "name1=value1&name2=value2", let's talk about the conceptual parsing process from client to server: 
  the above string is represented in ASCII in the computer as: 
  6E616D6531 3D 76616C756531 26 6E616D6532 3D 76616C756532. 
   6E616D6531: name1 
   3D:= 
   76616C756531: value1 
   26: &
   6E616D6532: name2 
   3D:= 
   76616C756532: The value2
 
   server can traverse the byte stream after receiving the data, first eat byte by byte, when the data is received After the 3D byte, the server knows that the byte eaten in front represents a key, and then eats it after thinking about it. If it encounters 26, it means that the value of the previous key is between the 3D and 26 subsections eaten just now. By analogy, the parameters passed by the client can be parsed.

   Now there is such a question, what should I do if my parameter value contains special characters such as = or &. 
For example, "name1=value1", where the value of value1 is the string "va&lu=e1", then it will actually become "name1=va&lu=e1" during the transmission process. Our original intention is to have only one key-value pair, but the server will resolve it into two key-value pairs, which creates a singularity.

How to solve the ambiguity caused by the above problems? The solution is to URL encode the parameters 
   URL encoding is simply adding % before each byte of special characters. For example, we URL-encode the above-mentioned strange characters: "name1=va%26lu%3D", so that the server will immediately follow The byte after "%" is treated as a normal byte, that is, it will not be used as a separator for each parameter or key-value pair.


Another question is why we use ASCII transmission, can we use other encodings? 
    Of course, you can use other encodings. You can develop a set of encodings and parse them yourself. Just like most countries have their own language. What should we do if we want to communicate between countries? In English, English is the most widely used.

 

 

Usually if something needs to be encoded, it means that it is not suitable for transmission. There are various reasons, such as the size is too large and contains private data. For Url, the reason for encoding is because some characters in Url will cause ambiguity.

  For example, the Url parameter string uses a key=value key-value pair to pass parameters, and the key-value pair is separated by an ampersand, such as /s?q=abc&ie=utf-8. If your value string contains = or &, it will inevitably cause the server receiving the Url to parse the error, so the ambiguous & and = symbols must be escaped, that is, encoded.

  For another example, the encoding format of Url is ASCII code, not Unicode, which means that you cannot include any non-ASCII characters in Url, such as Chinese. Otherwise, Chinese may cause problems if the character sets supported by the client browser and the server browser are different.

The principle of Url encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent those unsafe characters.

  Preliminary knowledge: URI is the meaning of Uniform Resource Identifier, usually what we call URL is just a kind of URI. The format of a typical URL is as follows. The URL encoding mentioned below should actually refer to the URI encoding.

foo://example.com:8042/over/there?name=ferret#nose

   \_/ \______________/ \________/\_________/ \__/

    |         |              |         |        |

  scheme     authority                path      query   fragment

   Which characters need to be encoded

  The RFC3986 document stipulates that only English letters (a-zA-Z), numbers (0-9), -_.~4 special characters and all reserved characters are allowed in Url. The RFC3986 document has made detailed suggestions on the encoding and decoding of Url, pointed out which characters need to be encoded so as not to cause the change of Url semantics, and explained why these characters need to be encoded.

  There is no corresponding printable character in the US-ASCII character set: only printable characters are allowed in Urls. The 10-7F bytes in the US-ASCII code all represent control characters, and none of these characters can directly appear in Url. At the same time, for 80-FF bytes (ISO-8859-1), because it has exceeded the byte range defined by US-ACII, it cannot be placed in Url.

  Reserved characters: Url can be divided into several components, protocol, host, path, etc. Some characters (:/?#[]@) are used to separate components. For example: colon is used to separate protocol and host, / is used to separate host and path, ? is used to separate path and query parameters, etc. There are also some characters (!$&'()*+,;=) used to separate each component, such as = is used to represent key-value pairs in query parameters, and the ampersand is used to separate query multiple key-value pairs. When ordinary data in a component contains these special characters, they need to be encoded.

  The following characters are designated as reserved characters in RFC3986: ! * ' ( ) ; : @ & = + $ , / ? # [ ]

  Unsafe characters: There are also characters that, when placed directly in the Url, may cause ambiguity to the parser. These characters are considered unsafe characters for a number of reasons.

  • Spaces: In the process of Url transmission, the user in the process of typesetting, or the text processing program in the process of Url processing, it is possible to introduce insignificant spaces, or to remove those meaningful spaces.
  • Quotes and <>: Quotes and angle brackets are often used to delimit URLs in normal text
  • #: Usually used to represent bookmarks or anchors
  • %: The percent sign itself is used as a special character when encoding unsafe characters, so it needs to be encoded by itself
  • {}|\^[]`~: Some gateways or transport agents will tamper with these characters

  It should be noted that for legal characters in Url, encoding and non-encoding are equivalent, but for the characters mentioned above, if they are not encoded, they may cause different Url semantics. Therefore, for Url, only ordinary English characters and numbers, special characters $-_.+!*'() and reserved characters can appear in unencoded Url. Other characters need to be encoded before they appear in the Url.

  However, due to historical reasons, there are still some non-standard encoding implementations. For example, for the ~ symbol, although the RFC3986 document stipulates that for the tilde ~, Url encoding is not required, but there are still many old gateways or transmission agents that will encode it.

  How to encode illegal characters in Url

  Url encoding is also commonly known as percent-encoding (Url Encoding, also known as percent-encoding), because its encoding method is very simple, using the percent percent sign plus two characters—0123456789ABCDEF—represents a The hexadecimal form of the byte. The default character set used by Url encoding is US-ASCII. For example, the corresponding byte of a in US-ASCII code is 0x61, then the Url encoding is %61. We enter http://g.cn/search?q=%61%62%63 in the address bar, In fact, it is equivalent to searching for abc on google. Another example is that the byte corresponding to the @ symbol in the ASCII character set is 0x40, which is %40 after Url encoding.

  For non-ASCII characters, you need to use a superset of the ASCII character set to encode the corresponding bytes, and then perform percent encoding on each byte. For Unicode characters, the RFC document recommends encoding them with utf-8 to get the corresponding bytes, and then performing percent-encoding on each byte. For example, the bytes obtained by "Chinese" using UTF-8 character set are 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and after Url encoding, "%E4%B8%AD%E6%96%87" is obtained.

  If a byte corresponds to an unreserved character in the ASCII character set, the byte need not be represented by a percent sign. For example, "Url encoding", the bytes obtained using UTF-8 encoding are 0x55 0x72 0x6C 0xE7 0xBC 0x96 0xE7 0xA0 0x81, since the first three bytes correspond to the unreserved character "Url" in ASCII, so these three bytes Can be represented by the unreserved character "Url". The final Url encoding can be simplified to "Url%E7%BC%96%E7%A0%81", of course, if you use "%55%72%6C%E7%BC%96%E7%A0%81" of.

  For historical reasons, there are some Url encoding implementations that do not fully follow this principle, which will be mentioned below.

  The difference between escape, encodeURI and encodeURIComponent in Javascript

  Three pairs of functions are provided in JavaScript to encode Url to get legal Url, they are escape / unescape, encodeURI / decodeURI and encodeURIComponent / decodeURIComponent. Since the process of decoding and encoding is reversible, only the encoding process is explained here.

  These three encoding functions—escape, encodeURI, encodeURIComponent—are all used to convert unsafe and illegal Url characters into legal Url character representations. They have the following differences.

  The safe characters are different:

  The safe characters for these three functions are listed below (i.e. the function does not encode these characters)

  • escape (69): */@+-._0-9a-zA-Z
  • encodeURI (82 个) :! # $ & '() * +, /:; =? @ -._ ~ 0-9a-zA-Z
  • encodeURIComponent(71个):!'()*-._~0-9a-zA-Z

  Compatibility is different: the escape function has existed since Javascript 1.0, and the other two functions were introduced in Javascript 1.5. But since Javascript 1.5 has become very popular, there is actually no compatibility problem with encodeURI and encodeURIComponent.

  Different encoding methods for Unicode characters: These three functions encode ASCII characters in the same way, which are represented by percent signs + two-digit hexadecimal characters. But for Unicode characters, escape is encoded as %uxxxx, where xxxx is a 4-digit hexadecimal character used to represent unicode characters. This approach has been deprecated by the W3C. But the encoding syntax of escape is still retained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters and then percent-encode them. This is recommended by the RFC. Therefore, it is recommended to use these two functions instead of escape for encoding as much as possible.

  The application is different: encodeURI is used to encode a complete URI, while encodeURIComponent is used to encode a component of the URI. From the safe character range table mentioned above, we will find that the character range encoded by encodeURIComponent is larger than that of encodeURI. As we mentioned above, reserved characters are generally used to separate URI components (a URI can be divided into multiple components, refer to the Preliminary Knowledge section) or subcomponents (such as the separator for query parameters in URIs), such as: is used to separate the scheme and the host, and the ? sign is used to separate the host and the path. Since the object manipulated by encodeURI is a complete URI, these characters have a special purpose in the URI, so these reserved characters will not be encoded by encodeURI, otherwise the meaning will change.

  Components have their own data representation formats, but these data cannot contain reserved characters that separate components, otherwise, the separation of components in the entire URI will be confused. So using encodeURIComponent for a single component requires more characters to encode.

  form submission

  When the Html form is submitted, each form field will be Url encoded before being sent. For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for spaces is not %20, but the + sign. If the form is submitted using the Post method, we can see in the HTTP header that there is a Content-Type header with the value application/x-www- form-urlencoded. Most applications can handle this non-standard implementation of Url encoding, but in client-side Javascript, there is no function that can decode the + sign into a space, and you can only write a conversion function yourself. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example, we add in the Html header

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

  In this way, the browser will use gb2312 to render this document (note that when this meta tag is not set in the HTML document, the browser will automatically select the character set according to the current user's preferences, and the user can also force the current website to use a specified character set set). When submitting the form, the character set used by the Url encoding is gb2312.

  I encountered a very confusing problem when using Aptana before (why it refers specifically to aptana), that is, when using encodeURI, I found that the result obtained by its encoding is very different from what I thought. Below is my sample code:

copy code
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
    </head>
    <body>
        <script type="text/javascript">
            document.write(encodeURI("中文"));
        </script>
    </body>
</html>
copy code

  运行结果输出%E6%B6%93%EE%85%9F%E6%9E%83。显然这并不是使用UTF-8字符集进行Url编码得到的结果(在Google上搜索"中文",Url中显示的是%E4%B8%AD%E6%96%87)。

  所以我当时就很质疑,难道encodeURI还跟页面编码有关,但是我发现,正常情况下,如果你使用gb2312进行Url编码也不会得到这个结果的才是。后来终于被我发现,原来是页面文件存储使用的字符集和Meta标签中指定的字符集不一致导致的问题。Aptana的编辑器默认情况下使用UTF-8字符集。也就是说这个文件实际存储的时候使用的是UTF-8字符集。但是由于Meta标签中指定了gb2312,这个时候,浏览器就会按照gb2312去解析这个文档,那么自然在"中文"这个字符串这里就会出错,因为"中文"字符串用UTF-8编码过后得到的字节是0xE4 0xB8 0xAD 0xE6 0x96 0x87,这6个字节又被浏览器拿gb2312去解码,那么就会得到另外三个汉字"涓枃"(GBK中一个汉字占两个字节),这三个汉字在传入encodeURI函数之后得到的结果就是%E6%B6%93%EE%85%9F%E6%9E%83。因此,encodeURI使用的还是UTF-8,并不会受到页面字符集的影响。

  对于包含中文的Url的处理问题,不同浏览器有不同的表现。例如对于IE,如果你勾选了高级设置"总是以UTF-8发送Url",那么Url中的路径部分的中文会使用UTF-8进行Url编码之后发送给服务端,而查询参数中的中文部分使用系统默认字符集进行Url编码。为了保证最大互操作性,建议所有放到Url中的组件全部显式指定某个字符集进行Url编码,而不依赖于浏览器的默认实现。

  In addition, many HTTP monitoring tools or browser address bars will automatically decode the Url once (using the UTF-8 character set) when displaying the Url, which is why when you access Google in Firefox to search for Chinese, the address bar The displayed Url contains Chinese. But actually the original Url sent to the server is still encoded. You can see this by accessing location.href using Javascript on the address bar. Don't be fooled by these illusions when studying Url codecs.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326524497&siteId=291194637
Recommended