5 Ways to Handle HTML Escape Characters in Python

Writing a crawler is a process of sending requests, extracting data, cleaning data, and storing data. In this process, the data formats returned by different data sources are different, including JSON format and XML documents, but most of them are HTML documents. HTML is often mixed with transfer characters. We need to escape these characters into real character of.

what is an escape character

In HTML,  <, >, & and other characters have special meanings (<, > are used in tags, & is used for escape), they cannot be used directly in HTML code, if you want to display these symbols in web pages, you need to use HTML conversion Escape Sequence, for example  < , the escape character is  &lt;, when the browser renders the HTML page, it will automatically replace the escape string with real characters.

The escape character (Escape Sequence) consists of three parts: the first part is an ampersand, the second part is the entity (Entity) name, and the third part is a semicolon. For example, to display the less than sign (<), you can write &lt; .

Python unescape string

There are many ways to deal with escaped strings in Python, and the way in py2 and py3 is different. In python2, the module for escaping escaped strings is  HTMLParser.

# python2
import HTMLParser
>>> HTMLParser().unescape('a=1&amp;b=2')
'a=1&b=2'

Python3 migrates the HTMLParser module to html.parser

# python3
>>> from html.parser import HTMLParser
>>> HTMLParser().unescape('a=1&amp;b=2')
'a=1&b=2'

After python3.4, the unescape method has been added to the html module.

# python3.4
>>> import html
>>> html.unescape('a=1&amp;b=2')
'a=1&b=2'

The last way of writing is recommended, because the HTMLParser.unescape method has been deprecated and deprecated in Python 3.4, which means that later versions may be completely removed.

In addition, the sax module also has functions that support escaping

>>> from xml.sax.saxutils import unescape
>>> unescape('a=1&amp;b=2')
'a=1&b=2'

Of course, you can completely implement your own anti-escaping function, which is not complicated. Of course, we advocate not reinventing the wheel.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326269947&siteId=291194637