I. Introduction reptile
What crawler is:
reptile is a parody of the browser behavior to the server sends a request and obtain the application data
reptiles metaphor:
the Internet is like a big net, the data is online prey, reptile is the spider
value reptiles: the
value of the data
reptiles process:
initiation request - get the data - analysis data - stored data
Two .http agreement related
Request: Request the URL of: where to go Request Method,: GET: ziduan? = The zhi & ziduan = the zhi POST: Request body: FormData json Files request header: Cookie: save the information (mainly: Record user login status) the User - Agent: user identity referer: tell the server that you come from a server-specific field response: Status Code: 2xx: successful request 3xx: redirection 4xx: 5xx: response header: LOCATION: redirect url the sET-cookie: cookie settings Server specific fields Response body: . 1 .html Code 2 binary: pictures, video, audio . 3 .json 4.jsonp
III. Request Library
If the stitching parameters params and repeat after 1.url, a list of ways to co-exist;
The cookies and cookie 2.header repeated, header preferentially displayed in a cookie;
3.json can not coexist with the data;
4. Save the local cookie to import import http.cookiejar as cookiejar, the save method to save
5.history to set a list of objects recorded redirect
''' requests ''' import requests # requests.get() # requests.post() # requests.request(method='post') #get请求 url = 'http://httpbin.org/get?name=mac&age=20&xxx=1000&xxx=yyy' headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36", 'Cookie':'aaa=123;bbb=234' } params = { 'zzz':'mmm', } cookies = { 'ccc':'999' } # url = 'http://www.tmall.com' # # r = requests.get(url=url,headers=headers,params=params,cookies=cookies,allow_redirects=True) # print(r.headers) #post请求 # url = 'http://httpbin.org/post' # # data = { # 'name':'mac', # } # # json = { # 'age':18 # } # # json = [1,True] # # files = { # 'files':open('xxx.txt','rt',encoding='utf-8') # } # # r = requests.post(url=url,files=files) # print(r.text) # url = 'http://www.tmall.com' # r = requests.get(url=url) # print(r.history[0].url) # print(r.url) # session = requests.session() # r = session.get('http://www.baidu.com') # print(r.cookies) # r = session.get('http://httpbin.org/get') # print(r.text) # import http.cookiejar as cookiejar # # session = requests.session() # session.cookies = cookiejar.LWPCookieJar() # # session.cookies.load('1.txt') # print(session.cookies) # r = session.get('http://www.baidu.com') # session.cookies.save('1.txt') r = requests.get(url='http://www.xiaohuar.com') print(r.text)
Installation ** ** : Requests the install PIP ** ** Use : ** request ** : ** ①get request: * ` response object = requests.get (......) * * parameters: ** URL: headers = {} Cookies = {} the params = {} Proxies = { ' HTTP ' : 'HTTP: // port: IP'} timeout = 0.5 allow_redirects = False ` ** ②post request: * ` response object = requests.post (......) ** Parameters: ** URL: headers = {} Cookies = } { Data = {} JSON = {} Files = { 'File': Open (..., 'RB')} timeout = 0.5 allow_redirects =False `` ` ** automatically save a cookie request: ** ` `` the session = requests.session () R & lt = Session.get (......) R & lt = session.post (.... ..) supplemented :( cookie stored locally) Import the http.cookiejar AS cookielib session.cookie = cookielib.LWPCookieJar () session.cookie.save (filename = ' 1.txt ' ) session.cookies.load (filename = ' . 1 .txt ' ) `` ` ** response: ** ` r.url request URL r.text obtain the response message body text r.encoding = ' GBK ' r.content binary r.json () json.loads (r.text) r.status_code r.headers r.cookies take Cookie r.history [1 relative object, the response object 2 ,. . . ] `` `
IV. Parsing related (css selector)
1 , class selector . Class name 2 , selector ID # ID value 3 , the selector tag label name 4 , descendant selectors Selector Selector 2 1 . 5 , the sub-selector to select 1 > selector 2 . 6 , attribute selector [attribute name] [attribute name = attribute value} {attribute name ^ = value], [$ attribute name = value] [attribute name * = value] 7 , group selector selector 1, selector 2 .. . or . 8 , the multi selector condition selector selector 2 and
Installation **: ** the install requests- PIP HTML ** Use: ** ** request: ** `` ` from requests_html Import HTMLSession the session = HTMLSession () ** Parameters: ** browser.args = [ ' --no-Sand ' , ' --user-Agent = xxxxx ' ] response object = session.request (......) in response to the object = Session.get (..... .) Response object = session.post (......) `` ` ** parameters and a module requests the same hair ** ** response: * ` `` r.url ** attributes and requests the module as a hair ` ** ** analytical: ** ** HTML object properties: ** ` r.html.absolute_links absolute path .links relative path .base_url root path .html corresponding to r .text .text get all labels page content .encoding = ' GBK '(Document.charset view coding) .raw_html corresponds r.html (binary stream) .pq modulation pyquery library `` ` ** HTML object methods: ** ` `` r.html.find ( ' CSS selector ' ) [element1, .......] .find ( ' CSS selectors ' , First = True) element1 .xpath (' selector XPath ') .xpath ( ' 'selector XPath ' , First = True) .search ( 'template') ( 'XXX yyy {} {}') [0] ( 'XXX yyy {name}} {pwd') [ 'name'] .search_all ( ' template ' ) .render (.....) = r.html.html after rendering html text ** Parameter : ** scripy: "" "() => { js code js code } " "" scrolldow: n- SLEEP: n- keep_page: True / False browser is closed to prevent `` ` Bypassing the browser detects `` ` () => { Object.defineProperties (Navigator, { the webdriver: { GET: () => undefined } }) ` `` ** interacting with the browser r.html.page.XXX ** `` ` asynic DEF XXX (): the await r.html.page.XXX session.loop.run .... (XXX ()) .screenshot ({ ' path ' : path}) .evaluate ( '' ' () => {JS Code}' ''}) .cookies () .Type ( 'CSS selectors', 'content', { 'delay': 100 }) .click ( 'css selector') .focus ( 'CSS selectors') .hover ( 'CSS selectors') .waitForSelector ( 'CSS selectors') .waitFor (1000) `` ` ** ** keyboard events r.html.page.keyboard.XXX `` ` .down ( 'the Shift') .UP ( 'the Shift') .press ( 'ArrowLeft') .type ( 'like you ah' , { 'Delay': 100}) `` ` ** ** mouse event r.html.page.mouse.XXX ` `` .click (X, Y, { 'Button': 'left', 'the Click ':. 1 ' Delay ':0 }) .down({'button':'left'}) .up({'button':'left'}) .move(x,y,{'steps':1}) ``` .
element attribute:
a = r.html.find('[class="special"] dd a',first=True) print(a.absolute_links) print(a.links) print(a.text) print(a.html) print(a.attrs.get('href'))
Common Database
### mongoDB4.0 (c drive not want to delete the last line in the configuration file):
Download: https://www.mongodb.com/
Installation: Slightly
Note: Modify the bin directory profile mongodb.cfg before use, delete the last line of 'mp' field
#### 1. Start service and termination of service
net start mongodb
net stop mongodb
2. Create an administrator user
mongo
use admin
db.createUser({user:"yxp",pwd:"997997",roles:["root"]})
3. Use the account password to connect mongodb
mongo -u adminUserName -p userPassword --authenticationDatabase admin
4. Database
show dbs View database
Switching Database
use db_name database switch
Increase the database
db.table1.insert ({ 'a': 1}) creating a database (switch to database tables and data insertion)
Delete Database
db.dropDatabase () to delete the database (to be deleted before the handover)
5. Table
Before use Switching Database
Table View
show tables to check all the tables
Increase table
db.table1.insert ({ 'b': 2}) increased (Table created does not exist)
Delete table
db.table1.drop () delete table
data
db.test.insert (user0) inserting a
db.user.insertMany ([user1, user2, user3 , user4, user5]) inserted into a plurality of
db.user.find ({ 'name': ' alex'}) to check xx = XX =
db.user.find ({ 'name': { "$ NE": 'Alex'}}) = XX search XX!
db.user.find ({ '_ ID': { '$ gt': 2}} ) check XX> XX
db.user.find ({ "_ ID": { "$ GTE": 2,}}) check XX> XX =
db.user.find ({ '_ ID': { '$ lt':. 3 }}) check XX <XX
db.user.find ({ "_ ID": { "$ LTE": 2}}) check XX <= XX
db.user.update ({ '_ ID': 2}, { "$ set ": {" name ": " WXX ",}}) modified data
db.user.deleteOne ({ 'age': 8 }) puncturing the first match
db.user.deleteMany ({ 'addr.country': ' china '}) to delete all matching
db.user.deleteMany ({}) to delete all
pymongo
= pymongo.MongoClient Conn (= Host Host, Port = Port, username = username, password = password)
DB = Client [ "db_name"] switching database
table = db [ 'table']
table.insert ({}) inserting data
table.remove ({}) to delete data
table.update ({ '_ id': 2}, { "$ set": { "name": "WXX",}}) modified data
table.find ({}) survey data