1 Reptile introduction
Reptile Framework 1.1
performance:
Concurrent scheme: Asynchronous IO (gevent / Twisted / asyncio / aiohttp), custom modules asynchronous IO, IO multiplexing: select
scrapy framework
Introduction asynchronous IO: Twisted, Twisted based Scrapy Source Definition crawler frame, while using Twisted Scrapy
1.2 Tornado frame (non-blocking asynchronous)
Tornado basic use
Source analysis
Custom non-blocking asynchronous frame
2. Basic Operations reptiles
2.1 requests template module
from bp4 import BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(response.text, features= "html.parser")
target = soup.find(id="xxx")
print(target)
* Reptile framework document courseware word "web crawler with access to information."
2.2 requests details
2.2.1 Anti-firewall
User-Agent manual modification in the headers, such as
'ser-Agent':"Mozilla/5.0"
2.2.2 Additional operations
- requests.post ( 'url') / get ( 'url') the following parameters can be loaded
submit url address
Data transmitted in the data in the request body, eleven transmission, where the body can request to dictionary, type, string, key in the common dictionary with data and can be done json
json transfer data in the request body, all connected in series in the interior of the string, the unified transmission, the dictionary nested dictionaries, data can only be transmitted json
Parameter params passed on Url, a specified number of method parameters, change the variable parameter method
cookies to identify the user identity, a sessio track
headers request header, modify the parameters for crawling firewall site
files for file operations
auth headers for added encrypted user name and password
timeout request and corresponding timeout
allow_redirects whether to allow redirection, ie non-single destination reptiles
Agent proxies
verify whether to ignore a certificate system
cert certificate file
requests.Session () to save the client access information history
- response = requests.post () Return value
response.get('url')
response.text output text
response.content outputting content (Any text)
response.encoding coding
response.aparent_encoding solve the garbage problem
- Request head / tail Request
Refer, save data
- submission of interactive data
1. The directly transmitted in the form of a message, in html- -Network review, the data is not refreshed in this way;
2. If you submit data to form form form, in html- review -Network during and after the data upload refresh this way.
2.3 BeautifulSoup functionality provided
For the following understanding about the code has been written in a single line comments.
from bs4 import BeautifulSoup #自定义html html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <div><a href='http://www.cunzhang.com'>剥夺老师<p>asdf</p></a></div> <a id='i1'>刘志超</a> <div> <p>asdf</p> </div> <p>asdfffffffffff</p> </body> </html> """ soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') v = tag.unwrap() print(Soup) from bs4.element Import the Tag OBJ1 = the Tag (name = ' div ' , attrs = { ' ID ' : ' IT ' }) obj1.string = ' I'm a new to ' Tag = soup.find ( ' a ' ) V = tag.wrap (OBJ1) # from the current row is inserted html Print (Soup) Tag = soup.find ( ' body ' ) # inserted from body <div> after a partially html, then the output <body> content between <div> <div> tag.append (Soup.find('a')) print(soup) from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('body') # tag.insert_before(obj) tag.insert_after(obj) print(soup) tag = soup.find('p',recursive=True) print(tag) tagSoup.find = ( ' body ' ) .find ( ' P ' , recursive This = False) Print (Tag) Tag = soup.find ( ' A ' ) V = tag.get_text () Print (V) # attribute operation tag = soup.find ( ' A ' ) tag.attrs [ ' Lover ' ] = ' physics teacher ' del tag.attrs [ ' the href ' ] Print (Soup) # Children: son # Tags and content from bs4.element Import Tag Tags = soup.find ( ' body ' ) .children for Tag in Tags: IF of the type (Tag) == Tag: Print (Tag, of the type (Tag)) the else : Print ( ' text .... ' ) Tags = soup.find ( ' body ' ) .descendants Print (List (Tags)) Tag = soup.find ( ' body ' ) # converts the object into a byte type print(tag.encode_contents ()) # Chinese characters into binary coded, <body> xxx <body> tag output disposable body content # converts the object into a string type Print (tag.decode_contents ()) # decoding, the binary conversion Chinese characters, <body> xxx <body> tag body content output branch # Print (STR (tag))