URL encoding

URL encoding is also known as percent sign encoding. The encoding method is very simple, that is, to represent a single byte in hexadecimal, and then place a percent sign in front of it.
For example, if there is a string like "abc", after we convert it to the byte order of ascii, it is expressed in hexadecimal as:
        61 62 63
To encode it with percent sign is to add "%" before each byte , the results are as follows:
        %61%62%63

Not all characters in the URL representation need to be percent encoded, RFC3986 (URI specification) stipulates that reserved characters and non-reserved characters can be encoded without encoding, and other characters must be encoded in percent number code.
RFC1738 (the URL specification) specifies that reserved and non-reserved characters can be used directly in URLs.

Reserved characters:
       ! * ' ( ) ; : @ & = + $ , / ? # [ ]
Non-reserved characters:
       az AZ 0-9 - _ . ~

In a URL, if a reserved character has special meaning in a specific context , and this reserved word has a special purpose in the URL, so the character must be percent-encoded.
For example, "/" represents a path separator in a URL. If a path contains this character, then the character in the path must be percent-encoded, otherwise it will be ambiguous with the real path separator.

There is also a The one that needs to be encoded with percent sign is "other characters", the so-called other characters are characters other than reserved characters and non-reserved characters, such as non-display characters of ascii codes, Chinese characters, etc.

As we know earlier, before performing percent encoding on a character, we need to get the byte stream form of the character, that is to say, we need to convert the character into a byte stream according to a certain character encoding, which character encoding should be used (such as GBK, UTF-8, etc.) is not given in RFC1738, so each program (such as browser) has its own way, but in January 2005, RFC3986 made a mandatory regulation, forcing "other characters" to first Convert to a UTF-8 byte sequence, then percent-encode its byte values.

The process of encoding the percent sign of the string "in a" is roughly as follows:
   1) Convert the string into a byte stream in the form of utf-8 encoding, then it is 0x61 E4 B8 AD
   2) Take a byte in sequence, yes or no Reserved word?
   3) Yes, the byte does not need to be encoded, and the ascii character represented by the byte is directly output
   4) No, it proves that the byte needs to be encoded, first output "%" and then output the hexadecimal uppercase of the byte Form
   5) If there is a next byte, perform step 3), and repeat this cycle until the encoding is completed.
   6)

The decoding process of the final result "a%E4%B8%AD" to the "a%E4%B8%AD" string is as follows:
   1) Convert the string into a byte stream
   2) Take a byte in sequence, mark the byte position as i, compare whether the byte is '%'
   3) No, output directly
   4) Yes, take (i+1) The position byte is shifted left by 4 bits + (i+2) the position byte & 0xF, then output
   5) skip two bytes, if there is a next byte, go to step 3),

This cycle completes the decoding Well, with the above knowledge, let's see if the browser's encoding of the URL is the same as the specification.
First, let's talk about the composition of the URL:
   {http://www.jd.com[/app/China]} ? (name=val)
   {}: Represents URL (absolute URI)
   []: Represents URI (relative URI, this identifier depends on the specific environment )
   (): Represents Query String

input directly in the address bar, and URIs are found in IE8, chrome, and firefox browsers to be percent-encoded in UTF-8.
But for query string, IE8 uses unprocessed Percent-encoded GBK original code (the encoding of the operating system may be used); utf-8 is used for percent-encoding on chrome and firefox.

URLs nested in web pages, for the path part of the URL, IE8, chrom, and firefox use UTF-8 encoding for percent sign encoding.
For the query string part, these three browsers use
  Content-Type:text/html in the http response header; charset=gbk is specified;
if not specified, it is used
  <meta http-equiv="Content-Type" in the page content="text/html;charset=gbk">specified.

From the above, we can know that the encoding method used by each browser for the path part of the URL is consistent with the specification, but the query string part is slightly different.

In addition, let's talk about the encodeURI() method of javascript, which does not encode reserved characters. For example, the following characters are not encoded:
  az AZ 0-9 - _ . ! ~ * ' ( ) ; / ? : @ & = + $ , #
So if the data part of a URL contains special reserved characters, it may not be possible to distinguish whether the character is part of the data or part of the URL (such as a path separator) after encoding the data with this method.

So javascript has the encodeURIComponent() method , From the name, you can see that this method encodes the "component" of the URL, and after encoding it, it can clearly distinguish whether a character is a "component" or a special separator of the URL.
This method does not encode non-reserved characters, such as:
   az AZ 0-9 - _ . ~
Other characters are encoded.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327094121&siteId=291194637