URL character encoding

URL character encoding

Introduction to URLs

overview

URL is an acronym for "Uniform Resource Locator", which translates to "web address" in Chinese, which means the Internet address of various resources.

The so-called resources can be simply understood as various files that can be accessed through the Internet, such as web pages, images, audio, video, JavaScript scripts, and so on. They can only be obtained on the Internet if their URLs are known.

As long as a resource is accessible via the Internet, it must have a corresponding URL. A URL corresponds to a resource, but the same resource may correspond to multiple URLs.

URLs are the foundation of the Internet. The reason why the Internet is "interconnected" is that web pages can contain other URLs through "links". Users can jump from one URL to another URL and go to different websites as long as they click.

URL characters

Components of a URL

  • 26 English letters (including uppercase and lowercase)
  • 1-10
  • hyphen (-)
  • Dot(.)
  • underscore (__)

URL character encoding (escaping method)

  • ASCII
  • unicode - utf8
  • urlcode

ASCII code

The method of URL character escaping: percent sign (%) + character hexadecimal

character hexadecimal

img

for example

www.baidu.com
可以转义为www.%62%61%69%64%75.com一样可以被浏览器识别

Unicode UTF

It uses 4-byte numbers to represent each letter, symbol, or ideograph. Each number represents a unique symbol used in at least one language.

Unicode is of course a large collection, and the current scale can accommodate more than 1 million symbols. The encoding of each symbol is different, for example, U+0639it represents Arabic letters Ain, U+0041represents English capital letters A, and U+4E25represents Chinese characters . For the specific symbol correspondence table, you can check unicode.org , or the special Chinese character correspondence table .

UTF-8

TF-8 uses variable-length bytes to represent. As the name suggests, the number of bytes used is variable. This change is related to the size of the Unicode number. Smaller numbers use fewer bytes, and larger numbers use more bytes. The number of bytes used varies from 1 to 4.
The encoding rules for UTF-8 are:

  • For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the Unicode code of this symbol, so for English letters, the UTF-8 code is the same as the ASCII code.

  • For an n-byte symbol (n>1), the first n bits of the first byte are all set to 1, the n+1th bit is set to 0, and the first two bits of the following byte are all set to 10, and the remaining binary bits not mentioned are all Unicode codes of this symbol.

The relationship between unicode encoding range and utf-8 binary

number range binary format
0x00 - 0x7f(0-127) 0xxxxxxx
0x80 - 0x7ff(128 - 2047) 110xxxxx 10xxxxxx
0x800 - 0xffff(2048-65535) 1110xxxx 10xxxxxx 10xxxxxx
0x10000 - 0x10ffff(65536) or more 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

for example

The unicode in the middle is 4e2d (100 1110 0010). In the above table, 4e2d is in the third row (0x800 - 0xffff), so the utf-8 encoding in the middle is (111001001011100010101101) converted to hexadecimal as e4b8ad

In other words, wherever there are Chinese characters in the URL , they must be written %e4%b8%ad. Therefore, to access the website www.example.com/China.html, it needs to be written as www.example.com/%e4%b8%ad%e5%9b%bd.html

URL code

Introduction to urlcode
urlcode is an encoding method, which is to encode the url of the http request string into urlcode, so that the http server can recognize it, so that there will be no garbled characters or misunderstandings before the http client and server.

Why use urlcode for encoding?
Because when the string is sent to the http server by url, Chinese and special characters (spaces, newlines) cannot appear in the string; so urlcode must be used for url.
Every
Chinese character has a urlcode encoding.
Every special character has a urlcode encoding
. The English urlcode is itself and will not change.

The principle of urldecode
When the httpserver receives the url sent by the client, it first performs urldecode to analyze the urlcode, and then implements the code logic after obtaining the most original query

The principle of urldecode
When the httpserver receives the url sent by the client, it first performs urldecode to analyze the urlcode, and then implements the code logic after obtaining the most original query

Guess you like

Origin blog.csdn.net/bo1029/article/details/131904669