Python HTML operation (the HTMLParser)

HTML programming operation is a very important one, in the following with Python3.x html.parser in HTMLParser class to parse the HTML.

HTMLParser class definitions and methods used

  • Standard library definitions

    1. HTMLParser is mainly used to parse the HTML file (including the HTML tag is invalid)
    2. Convert_charrefs parameter indicates whether all the character references automatically converted to Unicode form, after Python3.5 default is True
    3. HTMLParser can receive the appropriate HTML content, and parses encounter HTML tags will automatically call the appropriate handler (approach) to deal with, you need to create the appropriate subclass inherits HTMLParser themselves, and replication corresponding handler method
    4. HTMLParser does not check whether the start and end tags are a pair
  • Common method

    1. HTMLParser.feed (Data) : receiving a string of HTML content type, and analyzes
    2. HTMLParser.close( ): Treatment when faced with the end of file label. If you want to subclass the replication method, first call HTMLParser tired of close ()
    3. HTMLParser.reset( ): Reset HTMLParser instance, the method will lose untreated html content
    4. HTMLParser.getpos( ): Returns the current row and the corresponding offset
    5. HTMLParser.handle_starttag( Tagattrs ) : start treatment on the label. For example <div  ID = "main">, refers to a parameter tag div, attrs refers to a (name, Value) list
    6. HTMLParser.handle_endtag( Tag ) : treatment of the end of the tag. Example </ div>, it refers to a parameter tag div
    7. HTMLParser.handle_data( Data ) : Data processing method between the tag. <tag> test </ tag> , data refers to the "test"
    8. HTMLParser.handle_comment( Data ) : treatment of the HTML comments.

Examples of applications

  • Pending file: http://files.cnblogs.com/files/AlwinXu/Scan_TFS.zip
  • Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
__author__  =  'xua'
 
import  json
 
#For python 3.x
from  html.parser  import  HTMLParser
 
#定义HTMLParser的子类,用以复写HTMLParser中的方法
class  MyHTMLParser(HTMLParser):
 
     #构造方法,定义data数组用来存储html中的数据
     def  __init__( self ):
         HTMLParser.__init__( self )
         self .data  =  []
 
     #覆盖starttag方法,可以进行一些打印操作
     def  handle_starttag( self , tag, attrs):
         pass
         #print("Start Tag: ",tag)
         #for attr in attrs:
         #   print(attr)
     
     #覆盖endtag方法
     def  handle_endtag( self , tag):
         pass
 
     #覆盖handle_data方法,用来处理获取的html数据,这里保存在data数组
     def  handle_data( self , data):
         if  data.count( '\n' = =  0 :
             self .data.append(data)
 
 
#读取本地html文件.(当然也可以用urllib.request中的urlopen来打开网页数据并读取,这里不做介绍)
htmlFile  =  open (r "/Users/xualvin/Downloads/TFS.htm" , 'r' )
content  =  htmlFile.read()
 
#创建子类实例
parser  =  MyHTMLParser()
 
#将html数据传给解析器进行解析
parser.feed(content)
 
#对解析后的数据进行相应操作并打印
for  item  in  parser.data:
     if  item.startswith( "{\"columns\"" ):
         payloadDict  =  json.loads(item)
         list  =  payloadDict[ "payload" ][ "rows" ]
         for  backlog  in  list :
             if  backlog[ 1 = =  "Product Backlog Item"  or  backlog[ 1 = =  "Bug" :
                 print (backlog[ 2 ], "       Point: " ,backlog[ 3 ])

Guess you like

Origin www.cnblogs.com/jessitommy/p/11076353.html