html.parser --- Simple HTML and XHTML parser

Gendai: Lib/html/parser.py

This module defines a HTMLParser class that provides the basis for HTML (Hypertext Markup Language) and XHTML text file parsing.

class html.parser.HTMLParser(*, convert_charrefs=True)

Creates a parser instance capable of parsing invalid tags.

If convert_charrefs is True (default), all character references ( < Except for a i=4>/ elements) will be automatically converted to the corresponding Unicode characters. scriptstyle

An instance of the HTMLParser class is used to accept HTML data and appear at the beginning of the tag, the end of the tag, text, comments and other element tags when calling the corresponding method. To implement specific behavior, use a subclass of HTMLParser and overload its methods.

This parser does not check whether the closing tag matches the opening tag, nor does it trigger closing tag processing on elements that are implicitly closed when the outer element is completed.

Changed in version 3.4: convert_charrefs keyword argument was added.

Changed in version 3.5: The default value for the convert_charrefs parameter is now True.

Sample program for HTML parser

The following is a basic example of a simple HTML parser, using the HTMLParser class. When encountering the start tag, end tag and data When the time comes, print out the content.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

The output is:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

HTMLParser How

The HTMLParser instance has the following methods:

HTMLParser.feed(data)

fills some text into the parser. If it contains complete elements, it is processed; if the data is incomplete, it will be buffered until more data is filled, or close() is called . data must be of type str .

HTMLParser.close()

forces processing of all buffered data as if followed by an end-of-file mark. This method can be redefined by derived classes to define additional processing at the end of the input, but the redefined version should always call the base class HTMLParser < /span> method. close()

HTMLParser.reset()

Reset the instance. All unprocessed data is lost. Called implicitly during the instantiation phase.

HTMLParser.getpos()

Returns the current line number and offset value.

HTMLParser.get_starttag_text()

Returns the text in the most recently opened opening tag. This shouldn't usually be needed for structured processing, but may be useful when processing "deployed" HTML or when regenerating input with minimal changes (e.g. spacing between attributes can be preserved, etc.).

The following methods will be called when data or markup elements are encountered. They need to be overloaded in subclasses. There are no actual operations in the base class's implementation (except handle_startendtag() ):

HTMLParser.handle_starttag(tag, attrs)

Call this method to handle the opening tag of an element (for example <div id="main">).

The tag parameter is the tag name in lowercase. attrs The parameter is a list of the form (name, value) , containing all the tags <> Properties found in parentheses. name is converted to lowercase, value quotes are removed, characters and entities References will be replaced.

In the example, for the tag <A HREF="https://www.cwi.nl/">, this method will be called in the following form handle_starttag('a', [('href', 'https://www.cwi.nl/')]) .

All entity references in html.entities will be replaced with attribute values.

HTMLParser.handle_endtag(tag)

This method is used to handle the closing tag of an element (for example: </div> ).

tag The parameter is the tag name in lowercase.

HTMLParser.handle_startendtag(tag, attrs)

Similar to handle_starttag(), except that it is called when the parser encounters an XHTML-style empty tag ( <img ... />). This method can be overridden by subclasses that require this special lexical information; the default implementation simply calls handle_starttag() and handle_endtag() .

HTMLParser.handle_data(data)

This method is used to process arbitrary data (for example: text nodes and the content of <script>...</script> and <style>...</style> ).

HTMLParser.handle_entityref(name)

This method is used to handle named character references of the form &name; (e.g. >), where name< /span>, this method will never be called. is convert_charrefs'gt'). If is a general entity reference (for example: True

HTMLParser.handle_charref(name)

Call this method to handle decimal and hexadecimal numeric character references of the form and . For example, the decimal equivalent of is and the hexadecimal equivalent is ; in this case , the method will receive or . If convert_charrefs is , this method will never be called. &#NNN;&#xNNN;>>>'62''x3E'True

HTMLParser.handle_comment(data)

This method is called when an annotation is encountered (for example:  ).

For example, the annotation  will call this method with ' comment ' as the parameter.

The contents of Internet Explorer conditional comments (condcoms) are also sent to this method, so for  , this method will receive '[if IE 9]>IE9-specific content<![endif]' .

HTMLParser.handle_decl(decl)

This method is used to handle HTML doctype declarations (for example <!DOCTYPE html> ).

The formal parameter of decl is everything in the <!...> tag (for example: 'DOCTYPE html' ).

HTMLParser.handle_pi(data)

This method is called when a processing instruction is encountered. data The formal parameter will contain the entire processing instruction. For example, for the processing instruction <?proc color='red'> , this method will be called in the form handle_pi("proc color='red'") . It is intended to be overloaded by derived classes; there is no actual operation in the base class implementation.

Remark

The HTMLParser class uses SGML syntax rules to process instructions. XHTML processing instructions ending with '?' will cause '?' to be included in data .

HTMLParser.unknown_decl(data)

This method is called when the parser reads an unrecognized declaration.

The formal parameter of data is everything in the <![...]> tag. Overloading for derived classes is sometimes useful. There is no actual operation in the base class implementation.

example

The following class implements a parser for demonstration of more examples:

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = MyHTMLParser()

Parse a document type declaration:

>>>

>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
...             '"http://www.w3.org/TR/html4/strict.dtd">')
Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

Parse an element with some attributes and title:

>>>

>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
     attr: ('src', 'python-logo.png')
     attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data     : Python
End tag  : h1

script The content in the and style elements is returned unchanged without further parsing:

>>>

>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
     attr: ('type', 'text/css')
Data     : #python { color: green }
End tag  : style

>>> parser.feed('<script type="text/javascript">'
...             'alert("<strong>hello!</strong>");</script>')
Start tag: script
     attr: ('type', 'text/javascript')
Data     : alert("<strong>hello!</strong>");
End tag  : script

Parsing annotations:

>>>

>>> parser.feed('<!-- a comment -->'
...             '<!--[if IE 9]>IE-specific content<![endif]-->')
Comment  :  a comment
Comment  : [if IE 9]>IE-specific content<![endif]

Parse named or numeric character references and convert them to the correct characters (note: these 3 escapes are all '>' ):

>>>

>>> parser.feed('&gt;&#62;&#x3E;')
Named ent: >
Num ent  : >
Num ent  : >

Fills incomplete blocks for feed() execution, handle_data() may be called multiple times (unless convert_charrefs is set to True ):

>>>

>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
...     parser.feed(chunk)
...
Start tag: span
Data     : buff
Data     : ered
Data     : text
End tag  : span

Parsing invalid HTML (e.g. unquoted attributes) also works:

>>>

>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
Start tag: p
Start tag: a
     attr: ('class', 'link')
     attr: ('href', '#main')
Data     : tag soup
End tag  : p
End tag  : a