Python urllib, XML and HTMLParser

Reference Links: https://www.liaoxuefeng.com/wiki/1016959663602400/1019223241745024

The built-in modules urllib Python provides a range of methods for operating a url

Get

　　The request urllib can easily crawl the content URL, the GET request is sent to a specified page, and then returns the HTTP response

　　May also mimic the browser sends a GET request, the request object required, by adding the HTTP request object to the request header, the request can be disguised as a browser

Post

　　We need to pass parameters to bytes type

Handler

　　If you require more complex control, adding to the need to visit the Web site through a proxy, you need proxyhandler module

urllib functionality is provided by a variety of procedures are completed HTTP request, if needed to mimic the browser to complete a specific function, you need to request disguised as a browser request, camouflage method is to monitor the request sent by the browser, and then depending on the browser head camouflage's request, User-Agent header is used to identify a browser

XML

　　Reference Links: https://www.liaoxuefeng.com/wiki/1016959663602400/1017784095418144

　　There are two ways to manipulate XML: DOM and SAX. The entire XML DOM will read into memory, parsed into a tree, so they use large memory, analytical slow, the advantage of any node can traverse the tree. SAX is a streaming mode, while being read resolution, small memory, fast parsing, the disadvantage is that we need to deal with their own events.

　　Under normal circumstances, give priority to SAX, DOM because it takes up too much memory.

　　Note that this value using attrs

　　When a node SAX parser reads:

<a href="/">python</a>

　　It will have three events:

start_element events, reading <a href="/">time;
char_data events, reading pythontime;
end_element events, reading </a>time.

　　What event is it?

Import ParserCreate xml.parsers.expat from 
class DefaultSaxHandler (Object): 
    DEF START_ELEMENT (Self, name, attrs): 
        Print ( 'SAX: START_ELEMENT:% S, attrs: S%'% (name, STR (attrs))) # here you can write when the event to take place when reading this 
    DEF END_ELEMENT (Self, name): 
        Print ( 'SAX: END_ELEMENT:% S'% name) # here you can write event when read this to happen 
    def char_data (self , text): 
        Print ( 'SAX: char_data:% S'% text) # here you can write when the event to take place when reading this 
xml = r '' '<? xml Version = "1.0"?> 
<OL> 
    < Li> <a href="/python"> the Python </a> </ Li> 
    <Li> <a href="/ruby"> the Ruby </a> </ Li> 
</ OL> 
'' '
handler=DefaultSaxHandler()
parser=ParserCreate()
parser.StartElementHandler=handler.start_element
parser.EndElementHandler=handler.end_element
parser.CharacterDataHandler=handler.char_data
parser.Parse(xml)
#输出
sax:start_element:ol,attrs:{}
sax:char_data:

sax:char_data:
sax:start_element:li,attrs:{}
sax:start_element:a,attrs:{'href': '/python'}
sax:char_data:Python
sax:end_element:a
sax:end_element:li
sax:char_data:

sax:char_data:
sax:start_element:li,attrs:{}
sax:start_element:a,attrs:{'href': '/ruby'}
sax:char_data:Ruby
sax:end_element:a
sax:end_element:li
sax:char_data:

sax:end_element:ol

　　Note that when reading a large segment of string, CharacterDataHandlermay be called multiple times, so it is necessary to save themselves up in the EndElementHandlerrecombined inside.

　　In addition to parsing XML, the XML how to generate it? Need to generate XML structure are very simple case of 99%, therefore, the simplest and most effective method is to generate an XML string concatenation:

　　If you want to generate complex XML it? I suggest you do not use XML, into JSON.

　　When parsing XML, pay attention to identify the node of interest, response to an event, the saved node data. After the parsing is complete, the data can be processed.

　　For example, you can parse XML data obtained related to the city's weather information (omitted)

HTMLParser

　　Reference Links: https://www.liaoxuefeng.com/wiki/1016959663602400/1017784593019776

　　When we crawl the web, the next step is to parse the HTML page to see what's inside, in the end is a picture, video or text.

　　Use HTMLParser, can parse out the page of text, images and so on.

　　XML is a subset of HTML on nature, but no XML syntax requirements less stringent, it is not through the standard DOM or SAX to parse HTML

　　Fortunately, Python provides HTMLParser to easily parse HTML, simply a few lines of code:

　　Note that this value using attrs

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print('<%s>' % tag)

    def handle_endtag(self, tag):
        print('</%s>' % tag)

    def handle_startendtag(self, tag, attrs):
        print('<%s/>' % tag)#如img标签<img src="",alt=""/>

    def handle_data(self, data):
        print(data)

    def handle_comment(self, data):
        print('<!--', data, '-->')

    def handle_entityref(self, name):#解析特殊字符
        print('&%s;' % name)

    def handle_charref(self, name):# Analytical special character 
        print ( '& #% s;

parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href=\"#\">html</a> HTML tutorial...<br>END</p>
</body></html>''')
#输出
(sort) λ python fortest.py
<html>


<head>
</head>


<body>


<!--  test html parser  -->


<p>
Some
<a>
html
</a>
 HTML tutorial...
<br>
END
</p>


</body>
</html>

　　feed()Method can be called multiple times, which is not necessarily a whole string of HTML are to go into, you can go into a part of a part.

　　But this is how to identify specific id label it? , Pay attention to this value using attrs

def find_id(self,id_name,attrs):
        for i in attrs:
            if id_name in i:
                return True
        return False

    def handle_starttag(self, tag, attrs):
        if self.find_id('test1',attrs):
            print('<%s%s>' %(tag,str(attrs)))
    pass
    
    parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p class='test' id='test1'>Some <a href=\"#\">html</a> HTML tutorial...<br>END</p>
</body></html>''')
   
#输出
pass
<p[('class', 'test'), ('id', 'test1')]>
pass

　　Special characters, there are two, one is the English representation of  a digital representation of Ӓthese two characters can be parsed by Parser.