Web Crawler and Tornado

1 Reptile introduction

Reptile Framework 1.1

performance:

  Concurrent scheme: Asynchronous IO (gevent / Twisted / asyncio / aiohttp), custom modules asynchronous IO, IO multiplexing: select

scrapy framework

  Introduction asynchronous IO: Twisted, Twisted based Scrapy Source Definition crawler frame, while using Twisted Scrapy

1.2 Tornado frame (non-blocking asynchronous)

Tornado basic use

Source analysis

Custom non-blocking asynchronous frame

 

2. Basic Operations reptiles

2.1 requests template module

from bp4 import BeautifulSoup

r = requests.get(url)

soup = BeautifulSoup(response.text, features= "html.parser")

target = soup.find(id="xxx")

print(target)

* Reptile framework document courseware word "web crawler with access to information."

 

2.2 requests details

2.2.1 Anti-firewall

User-Agent manual modification in the headers, such as

'ser-Agent':"Mozilla/5.0"

 

2.2.2 Additional operations

- requests.post ( 'url') / get ( 'url') the following parameters can be loaded

submit url address

Data transmitted in the data in the request body, eleven transmission, where the body can request to dictionary, type, string, key in the common dictionary with data and can be done json

json transfer data in the request body, all connected in series in the interior of the string, the unified transmission, the dictionary nested dictionaries, data can only be transmitted json

Parameter params passed on Url, a specified number of method parameters, change the variable parameter method

cookies to identify the user identity, a sessio track

headers request header, modify the parameters for crawling firewall site

files for file operations

auth headers for added encrypted user name and password

timeout request and corresponding timeout

allow_redirects whether to allow redirection, ie non-single destination reptiles

Agent proxies

verify whether to ignore a certificate system

cert certificate file

requests.Session () to save the client access information history

 

- response = requests.post () Return value

response.get('url')  

response.text output text

response.content outputting content (Any text)

response.encoding coding

response.aparent_encoding solve the garbage problem

 

- Request head / tail Request

Refer, save data

 

- submission of interactive data

1. The directly transmitted in the form of a message, in html- -Network review, the data is not refreshed in this way;

2. If you submit data to form form form, in html- review -Network during and after the data upload refresh this way.

 

2.3 BeautifulSoup functionality provided

 For the following understanding about the code has been written in a single line comments.

from bs4 import BeautifulSoup

#自定义html
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <div><a href='http://www.cunzhang.com'>剥夺老师<p>asdf</p></a></div>
    <a id='i1'>刘志超</a>
    <div>
        <p>asdf</p>
    </div>
    <p>asdfffffffffff</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, features="html.parser")

tag = soup.find('a')
v = tag.unwrap()
print(Soup) 

from bs4.element Import the Tag 
OBJ1 = the Tag (name = ' div ' , attrs = { ' ID ' : ' IT ' }) 
obj1.string = ' I'm a new to ' 

Tag = soup.find ( ' a ' ) 
V = tag.wrap (OBJ1)   # from the current row is inserted html 
Print (Soup) 


Tag = soup.find ( ' body ' ) # inserted from body <div> after a partially html, then the output <body> content between <div> <div> 
tag.append (Soup.find('a'))
print(soup)

from bs4.element import Tag
obj = Tag(name='i', attrs={'id': 'it'})
obj.string = '我是一个新来的'
tag = soup.find('body')
# tag.insert_before(obj)
tag.insert_after(obj)
print(soup)


tag = soup.find('p',recursive=True)
print(tag)
tagSoup.find = ( ' body ' ) .find ( ' P ' , recursive This = False)
 Print (Tag) 

Tag = soup.find ( ' A ' ) 
V = tag.get_text ()
 Print (V) 

# attribute operation 
tag = soup.find ( ' A ' ) 
tag.attrs [ ' Lover ' ] = ' physics teacher ' 
del tag.attrs [ ' the href ' ]
 Print (Soup) 

# Children: son 
# Tags and content 
from bs4.element Import Tag 
Tags = soup.find ( ' body ' ) .children
 for Tag in Tags:
     IF of the type (Tag) == Tag:
         Print (Tag, of the type (Tag))
     the else :
         Print ( ' text .... ' ) 

Tags = soup.find ( ' body ' ) .descendants
 Print (List (Tags)) 

Tag = soup.find ( ' body ' )
 # converts the object into a byte type
 print(tag.encode_contents ())     # Chinese characters into binary coded, <body> xxx <body> tag output disposable body content 
# converts the object into a string type 
Print (tag.decode_contents ())     # decoding, the binary conversion Chinese characters, <body> xxx <body> tag body content output branch 
# Print (STR (tag))

 

Guess you like

Origin www.cnblogs.com/yuyukun/p/11626289.html