Reference Links: https://www.liaoxuefeng.com/wiki/1016959663602400/1019223241745024
The built-in modules urllib Python provides a range of methods for operating a url
Get
The request urllib can easily crawl the content URL, the GET request is sent to a specified page, and then returns the HTTP response
May also mimic the browser sends a GET request, the request object required, by adding the HTTP request object to the request header, the request can be disguised as a browser
Post
We need to pass parameters to bytes type
Handler
If you require more complex control, adding to the need to visit the Web site through a proxy, you need proxyhandler module
urllib functionality is provided by a variety of procedures are completed HTTP request, if needed to mimic the browser to complete a specific function, you need to request disguised as a browser request, camouflage method is to monitor the request sent by the browser, and then depending on the browser head camouflage's request, User-Agent header is used to identify a browser
XML
Reference Links: https://www.liaoxuefeng.com/wiki/1016959663602400/1017784095418144
There are two ways to manipulate XML: DOM and SAX. The entire XML DOM will read into memory, parsed into a tree, so they use large memory, analytical slow, the advantage of any node can traverse the tree. SAX is a streaming mode, while being read resolution, small memory, fast parsing, the disadvantage is that we need to deal with their own events.
Under normal circumstances, give priority to SAX, DOM because it takes up too much memory.
Note that this value using attrs
When a node SAX parser reads:
<a href="/">python</a>
It will have three events:
-
start_element events, reading
<a href="/">
time; -
char_data events, reading
python
time; -
end_element events, reading
</a>
time.
What event is it?
Import ParserCreate xml.parsers.expat from class DefaultSaxHandler (Object): DEF START_ELEMENT (Self, name, attrs): Print ( 'SAX: START_ELEMENT:% S, attrs: S%'% (name, STR (attrs))) # here you can write when the event to take place when reading this DEF END_ELEMENT (Self, name): Print ( 'SAX: END_ELEMENT:% S'% name) # here you can write event when read this to happen def char_data (self , text): Print ( 'SAX: char_data:% S'% text) # here you can write when the event to take place when reading this xml = r '' '<? xml Version = "1.0"?> <OL> < Li> <a href="/python"> the Python </a> </ Li> <Li> <a href="/ruby"> the Ruby </a> </ Li> </ OL> '' ' handler=DefaultSaxHandler() parser=ParserCreate() parser.StartElementHandler=handler.start_element parser.EndElementHandler=handler.end_element parser.CharacterDataHandler=handler.char_data parser.Parse(xml) #输出 sax:start_element:ol,attrs:{} sax:char_data: sax:char_data: sax:start_element:li,attrs:{} sax:start_element:a,attrs:{'href': '/python'} sax:char_data:Python sax:end_element:a sax:end_element:li sax:char_data: sax:char_data: sax:start_element:li,attrs:{} sax:start_element:a,attrs:{'href': '/ruby'} sax:char_data:Ruby sax:end_element:a sax:end_element:li sax:char_data: sax:end_element:ol
Note that when reading a large segment of string, CharacterDataHandler
may be called multiple times, so it is necessary to save themselves up in the EndElementHandler
recombined inside.
In addition to parsing XML, the XML how to generate it? Need to generate XML structure are very simple case of 99%, therefore, the simplest and most effective method is to generate an XML string concatenation:
If you want to generate complex XML it? I suggest you do not use XML, into JSON.
When parsing XML, pay attention to identify the node of interest, response to an event, the saved node data. After the parsing is complete, the data can be processed.
For example, you can parse XML data obtained related to the city's weather information (omitted)
HTMLParser
Reference Links: https://www.liaoxuefeng.com/wiki/1016959663602400/1017784593019776
When we crawl the web, the next step is to parse the HTML page to see what's inside, in the end is a picture, video or text.
Use HTMLParser, can parse out the page of text, images and so on.
XML is a subset of HTML on nature, but no XML syntax requirements less stringent, it is not through the standard DOM or SAX to parse HTML
Fortunately, Python provides HTMLParser to easily parse HTML, simply a few lines of code:
Note that this value using attrs
from html.parser import HTMLParser from html.entities import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print('<%s>' % tag) def handle_endtag(self, tag): print('</%s>' % tag) def handle_startendtag(self, tag, attrs): print('<%s/>' % tag)#如img标签<img src="",alt=""/> def handle_data(self, data): print(data) def handle_comment(self, data): print('<!--', data, '-->') def handle_entityref(self, name):#解析特殊字符 print('&%s;' % name) def handle_charref(self, name):# Analytical special character print ( '& #% s; parser = MyHTMLParser() parser.feed('''<html> <head></head> <body> <!-- test html parser --> <p>Some <a href=\"#\">html</a> HTML tutorial...<br>END</p> </body></html>''') #输出 (sort) λ python fortest.py <html> <head> </head> <body> <!-- test html parser --> <p> Some <a> html </a> HTML tutorial... <br> END </p> </body> </html>
feed()
Method can be called multiple times, which is not necessarily a whole string of HTML are to go into, you can go into a part of a part.
But this is how to identify specific id label it? , Pay attention to this value using attrs
def find_id(self,id_name,attrs): for i in attrs: if id_name in i: return True return False def handle_starttag(self, tag, attrs): if self.find_id('test1',attrs): print('<%s%s>' %(tag,str(attrs))) pass parser.feed('''<html> <head></head> <body> <!-- test html parser --> <p class='test' id='test1'>Some <a href=\"#\">html</a> HTML tutorial...<br>END</p> </body></html>''') #输出 pass <p[('class', 'test'), ('id', 'test1')]> pass
Special characters, there are two, one is the English representation of
a digital representation of Ӓ
these two characters can be parsed by Parser.