Reptiles 2 (1)

I. Introduction reptile

What crawler is: 
    reptile is a parody of the browser behavior to the server sends a request and obtain the application data 

reptiles metaphor: 
    the Internet is like a big net, the data is online prey, reptile is the spider 

value reptiles: the 
    value of the data 

reptiles process: 
    initiation request - get the data - analysis data - stored data

Two .http agreement related

Request: 
    Request the URL of: where to go 
    Request Method,: 
        GET: 
            ziduan? = The zhi & ziduan = the zhi 
        POST: 
            Request body: 
                FormData 
                json 
                Files 
    request header: 
        Cookie: save the information (mainly: Record user login status) 
        the User - Agent: user identity 
        referer: tell the server that you come from 
        a server-specific field 

response: 
    Status Code: 
        2xx: 
            successful request 
        3xx: 
            redirection 
        4xx: 
        5xx: 
    response header: 
        LOCATION: redirect url 
        the sET-cookie: cookie settings
        Server specific fields
    Response body:
         . 1 .html Code
         2 binary: pictures, video, audio
         . 3 .json
         4.jsonp

III. Request Library

 If the stitching parameters params and repeat after 1.url, a list of ways to co-exist;

 The cookies and cookie 2.header repeated, header preferentially displayed in a cookie;

 3.json can not coexist with the data;

 4. Save the local cookie to import import http.cookiejar as cookiejar, the save method to save

 5.history to set a list of objects recorded redirect

'''
requests
'''
import requests

# requests.get()
# requests.post()
# requests.request(method='post')


#get请求
url = 'http://httpbin.org/get?name=mac&age=20&xxx=1000&xxx=yyy'

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
    'Cookie':'aaa=123;bbb=234'
}

params = {
    'zzz':'mmm',
}

cookies = {
    'ccc':'999'
}

# url = 'http://www.tmall.com'
#
# r = requests.get(url=url,headers=headers,params=params,cookies=cookies,allow_redirects=True)
# print(r.headers)


#post请求
# url = 'http://httpbin.org/post'
#
# data = {
#     'name':'mac',
# }
#
# json = {
#     'age':18
# }
#
# json = [1,True]
#
# files = {
#     'files':open('xxx.txt','rt',encoding='utf-8')
# }
#
# r = requests.post(url=url,files=files)
# print(r.text)



# url = 'http://www.tmall.com'
# r = requests.get(url=url)
# print(r.history[0].url)
# print(r.url)


# session = requests.session()
# r = session.get('http://www.baidu.com')
# print(r.cookies)
# r = session.get('http://httpbin.org/get')
# print(r.text)

# import http.cookiejar as cookiejar
#
# session = requests.session()
# session.cookies = cookiejar.LWPCookieJar()
#
# session.cookies.load('1.txt')
# print(session.cookies)
# r = session.get('http://www.baidu.com')
# session.cookies.save('1.txt')


r = requests.get(url='http://www.xiaohuar.com')
print(r.text)
Installation ** ** : Requests the install PIP
 ** ** Use :
 ** request ** :
 ** ①get request: * 

` 
                response object = requests.get (......)
 * * parameters: ** 

URL: 

headers = {} 

Cookies = {} 

the params = {}     

Proxies = { ' HTTP ' : 'HTTP: // port: IP'} 

timeout = 0.5

allow_redirects = False 
`


 ** ②post request: * 

` 
                response object = requests.post (......)
 ** Parameters: ** 

URL: 

headers = {} 

Cookies = } { 

Data = {} 

JSON = {} 

Files = { 'File': Open (..., 'RB')} 

timeout = 0.5 

allow_redirects =False 
`` `
 ** automatically save a cookie request: ** 

` `` 
            the session = requests.session () 

R & lt = Session.get (......) 

R & lt = session.post (.... ..) 
      supplemented :( cookie stored locally) 
          Import the http.cookiejar AS cookielib 
          session.cookie = cookielib.LWPCookieJar () 
          session.cookie.save (filename = ' 1.txt ' ) 
          
          session.cookies.load (filename = ' . 1 .txt ' ) 
`` `
 ** response: **
 
`
            r.url request URL 

r.text obtain the response message body text 

r.encoding = ' GBK '         

r.content binary 

r.json () json.loads (r.text) 

r.status_code             

r.headers 

r.cookies take Cookie 

r.history [1 relative object, the response object 2 ,. . . ] 
`` `

IV. Parsing related (css selector)

1 , class selector 

. Class name
 2 , selector ID 
# ID value
 3 , the selector tag 
label name 4 , descendant selectors 
Selector Selector 2 1 . 5 , the sub-selector 
to select 1 > selector 2 . 6 , attribute selector 
[attribute name] 
[attribute name = attribute value} 
{attribute name ^ = value], 
[$ attribute name = value] 
[attribute name * = value] 7 , group selector 
selector 1, selector 2 .. .     














or . 8 , the multi selector condition 
selector selector 2             and

五.requests-html

Installation **: ** the install requests- PIP HTML
 ** Use: ** ** request: ** 
`` ` from requests_html Import HTMLSession 
the session = HTMLSession () ** Parameters: ** 
browser.args = [ ' --no-Sand ' , ' --user-Agent = xxxxx ' 
] 
response object = session.request (......) 
in response to the object = Session.get (..... .)


            









Response object = session.post (......) 
`` `
 ** parameters and a module requests the same hair ** ** response: * 
` `` 
            r.url ** attributes and requests the module as a hair 
` ** ** analytical: ** ** HTML object properties: ** 
` 
            r.html.absolute_links absolute path 
.links relative path 
.base_url root path 
.html corresponding to r .text 
.text get all labels page content 
.encoding = ' GBK











 
'(Document.charset view coding) 

.raw_html corresponds r.html (binary stream) 

.pq modulation pyquery library 
`` `
 ** HTML object methods: ** 

` `` 
            r.html.find ( ' CSS selector ' ) [element1, .......] 

.find ( ' CSS selectors ' , First = True) element1 

.xpath (' selector XPath ') 

.xpath ( ' 'selector XPath ' , First = True) 

.search ( 'template') 

( 'XXX yyy {} {}') [0]

( 'XXX yyy {name}} {pwd') [ 'name'] 

.search_all ( ' template ' ) 

.render (.....) = r.html.html after rendering html text
 ** Parameter : ** 

scripy: "" "() => { 

js code 

js code 

} 

" "" 

scrolldow: n- 

SLEEP: n- 

keep_page: True / False browser is closed to prevent 
`` `


Bypassing the browser detects 
`` ` 
() => { 
Object.defineProperties (Navigator, { 
        the webdriver: { 
        GET: () => undefined 
        } 
    }) 
` ``
 ** interacting with the browser r.html.page.XXX ** 

`` ` 
                asynic DEF XXX (): 

the await r.html.page.XXX 

session.loop.run .... (XXX ()) 

.screenshot ({ ' path ' : path}) 

.evaluate ( '' ' () => {JS Code}' ''}) 

.cookies () 

.Type ( 'CSS selectors', 'content', { 'delay': 100 })

.click ( 'css selector') 

.focus ( 'CSS selectors') 

.hover ( 'CSS selectors') 

.waitForSelector ( 'CSS selectors') 

.waitFor (1000) 
`` ` 

** ** keyboard events r.html.page.keyboard.XXX 

`` ` 
            .down ( 'the Shift') 

.UP ( 'the Shift') 

.press ( 'ArrowLeft') 

.type ( 'like you ah' , { 'Delay': 100}) 
`` ` 

** ** mouse event r.html.page.mouse.XXX 
` `` 
            .click (X, Y, { 
                'Button': 'left', 
                'the Click ':. 1 
                ' Delay ':0
            })
            .down({'button':'left'})


            .up({'button':'left'})
            .move(x,y,{'steps':1})

```

​            .

element attribute:

 a = r.html.find('[class="special"] dd a',first=True)
 print(a.absolute_links)
 print(a.links)
 print(a.text)
 print(a.html)
 print(a.attrs.get('href'))

Common Database

### mongoDB4.0 (c drive not want to delete the last line in the configuration file):

Download: https://www.mongodb.com/

Installation: Slightly

Note: Modify the bin directory profile mongodb.cfg before use, delete the last line of 'mp' field

#### 1. Start service and termination of service

net start mongodb

net stop mongodb

2. Create an administrator user

mongo

use admin

db.createUser({user:"yxp",pwd:"997997",roles:["root"]})

3. Use the account password to connect mongodb

mongo -u adminUserName -p userPassword --authenticationDatabase admin

4. Database

View database
show dbs View database
Switching Database
 use db_name database switch
Increase the database
db.table1.insert ({ 'a': 1}) creating a database (switch to database tables and data insertion)
Delete Database
db.dropDatabase () to delete the database (to be deleted before the handover)

5. Table

Before use Switching Database
Table View
show tables to check all the tables
Increase table
db.table1.insert ({ 'b': 2}) increased (Table created does not exist)
Delete table
db.table1.drop () delete table

data

db.test.insert (user0) inserting a 
db.user.insertMany ([user1, user2, user3 , user4, user5]) inserted into a plurality of
db.user.find ({ 'name': ' alex'}) to check xx = XX =
db.user.find ({ 'name': { "$ NE": 'Alex'}}) = XX search XX!
db.user.find ({ '_ ID': { '$ gt': 2}} ) check XX> XX
db.user.find ({ "_ ID": { "$ GTE": 2,}}) check XX> XX =
db.user.find ({ '_ ID': { '$ lt':. 3 }}) check XX <XX
db.user.find ({ "_ ID": { "$ LTE": 2}}) check XX <= XX
db.user.update ({ '_ ID': 2}, { "$ set ": {" name ": " WXX ",}}) modified data
db.user.deleteOne ({ 'age': 8 }) puncturing the first match
db.user.deleteMany ({ 'addr.country': ' china '}) to delete all matching
db.user.deleteMany ({}) to delete all

pymongo

= pymongo.MongoClient Conn (= Host Host, Port = Port, username = username, password = password) 
DB = Client [ "db_name"] switching database
table = db [ 'table']
table.insert ({}) inserting data
table.remove ({}) to delete data
table.update ({ '_ id': 2}, { "$ set": { "name": "WXX",}}) modified data
table.find ({}) survey data

 

 

Guess you like

Origin www.cnblogs.com/sima-3/p/11695752.html