URL percentage encoding and symbol characters

1. Percent-encoding

Percent encoding is an encoding mechanism with 8-bit character encoding, and these encodings have specific meanings in the context of URLs. It is sometimes called URL encoding. The code is composed of English letter replacement: "%" followed by the ASCII hexadecimal representation of the replacement character.

The special characters that need to be encoded are:

Special characters Percent coding
: % 3A
/ % 2F
? % 3F
@ %40
$ %24
& %26
= % 3D
% %25

For more reference: https://developer.mozilla.org/zh-CN/docs/Glossary/percent-encoding

2. Character entities

If we want to display the reserved characters in HTML correctly, we must replace them with characters in the HTML source code. Useful character entities in HTML:

Show results description Entity name Entity number
Space    
< Less than sign &lt; &#60;
> Greater than &gt; &#62;
& And sign &amp; &#38;
" quotation marks &quot; &#34;
' apostrophe &apos; (IE does not support) &#39;
¥ Yuan (yen) &yen; &#165;
© Copyright &copy; &#169;
® Trademark &reg; &#174;
trademark &trade; &#8482;
× Multiplication sign &times; &#215;
÷ Division sign &divide; &#247;

For more reference: https://www.w3school.com.cn/tags/html_ref_entities.html

3. The character entity appears in the URL

From the characters listed above, we can see that there are actually many special characters that have both percentage encoding and corresponding character entities. Then let's consider the following scenario.

There is a URL link in the HTML source document, and there is a "&" character in the URL, so how to correctly encode the "&" character as %26 or &?

  • If the URL link is only for rendering on the page, then it can be regarded as pure HTML content, so the link content should be replaced according to the entity character rule;
  • If the URL connection is placed in the <a> tag for redirection, then in order to ensure that the URL address can be parsed correctly, the content needs to be encoded according to the percentage encoding format;

For example, if we use the following code to process html files:

String htmlContext = "<body>http://localhost/path?age=123&name=ddd</body>" ;
String html = org.apache.commons.lang.StringEscapeUtils.escapeHtml(htmlContext) ;

Then the final output html content is

&lt;body&gt;http://localhost/path?age=123&amp;name=ddd&lt;/body&gt;

It can be seen that the "&" character in the url &amp;is replaced by " ", but this is probably not what we want. What we want is to encode the percentage of "&" into "%26"

If we use

String html = java.net.URLEncoder.encode(htmlContext,"UTF-8")

Then the final output html content is

%3Cbody%3Ehttp%3A%2F%2Flocalhost%2Fpath%3Fage%3D123%26name%3Dddd%3C%2Fbody%3E

Of course, the correct approach should be to insert the hyperlink URL into the HTML document after percentage encoding, and then replace the character entity of the HTML document

Guess you like

Origin blog.51cto.com/dengshuangfu/2675342