HTML programming operation is a very important one, in the following with Python3.x html.parser in HTMLParser class to parse the HTML.
HTMLParser class definitions and methods used
-
Standard library definitions
-
- HTMLParser is mainly used to parse the HTML file (including the HTML tag is invalid)
- Convert_charrefs parameter indicates whether all the character references automatically converted to Unicode form, after Python3.5 default is True
- HTMLParser can receive the appropriate HTML content, and parses encounter HTML tags will automatically call the appropriate handler (approach) to deal with, you need to create the appropriate subclass inherits HTMLParser themselves, and replication corresponding handler method
- HTMLParser does not check whether the start and end tags are a pair
-
Common method
-
- HTMLParser.feed (Data) : receiving a string of HTML content type, and analyzes
HTMLParser.
close
( ): Treatment when faced with the end of file label. If you want to subclass the replication method, first call HTMLParser tired of close ()HTMLParser.
reset
( ): Reset HTMLParser instance, the method will lose untreated html contentHTMLParser.
getpos
( ): Returns the current row and the corresponding offsetHTMLParser.
handle_starttag
( Tag , attrs ) : start treatment on the label. For example <div ID = "main">, refers to a parameter tag div, attrs refers to a (name, Value) listHTMLParser.
handle_endtag
( Tag ) : treatment of the end of the tag. Example </ div>, it refers to a parameter tag divHTMLParser.
handle_data
( Data ) : Data processing method between the tag. <tag> test </ tag> , data refers to the "test"HTMLParser.
handle_comment
( Data ) : treatment of the HTML comments.
Examples of applications
- Pending file: http://files.cnblogs.com/files/AlwinXu/Scan_TFS.zip
- Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
__author__
=
'xua'
import
json
#For python 3.x
from
html.parser
import
HTMLParser
#定义HTMLParser的子类,用以复写HTMLParser中的方法
class
MyHTMLParser(HTMLParser):
#构造方法,定义data数组用来存储html中的数据
def
__init__(
self
):
HTMLParser.__init__(
self
)
self
.data
=
[]
#覆盖starttag方法,可以进行一些打印操作
def
handle_starttag(
self
, tag, attrs):
pass
#print("Start Tag: ",tag)
#for attr in attrs:
# print(attr)
#覆盖endtag方法
def
handle_endtag(
self
, tag):
pass
#覆盖handle_data方法,用来处理获取的html数据,这里保存在data数组
def
handle_data(
self
, data):
if
data.count(
'\n'
)
=
=
0
:
self
.data.append(data)
#读取本地html文件.(当然也可以用urllib.request中的urlopen来打开网页数据并读取,这里不做介绍)
htmlFile
=
open
(r
"/Users/xualvin/Downloads/TFS.htm"
,
'r'
)
content
=
htmlFile.read()
#创建子类实例
parser
=
MyHTMLParser()
#将html数据传给解析器进行解析
parser.feed(content)
#对解析后的数据进行相应操作并打印
for
item
in
parser.data:
if
item.startswith(
"{\"columns\""
):
payloadDict
=
json.loads(item)
list
=
payloadDict[
"payload"
][
"rows"
]
for
backlog
in
list
:
if
backlog[
1
]
=
=
"Product Backlog Item"
or
backlog[
1
]
=
=
"Bug"
:
print
(backlog[
2
],
" Point: "
,backlog[
3
])
|