Python Reptile: requests Comments library, cookie and actual combat operations

original

Requests are based on third-party libraries written urllib. Urllib powerful than the library, is very suitable for the preparation of crawlers.

Installation: pip install requests

Simple climb Baidu home page examples:

 

 

 response.text and response.content difference:

  • response.text code string through the solution. Easier garbled
  • response.content undecoded binary format (bytes). applies to text, pictures and music. If the text is to be used response.content.decode ( 'utf-8') decoding

requests library supports the request method:

import requests

requests.get("http://xxxx.com/")
requests.post("http://xxxx.com/post", data = {'key':'value'})
requests.put("http://xxxx.com/put", data = {'key':'value'})
requests.delete("http://xxxx.com/delete")
requests.head("http://xxxx.com/get")
requests.options("http://xxxx.com/get")

Get request transmitted parameters:

  Params parameters can be set in a dictionary format in the get method. requests will automatically splicing method of url

import requests

params = {
    "wd": "python", "pn": 10,
}

Response = requests.get ( ' https://www.baidu.com/s ' , the params = the params)
 Print (response.url)
 Print (response.text) 
'' '
need to set the header, Baidu will be verified anti-climb
' ''

post requests for data transmission with:

Only need to set data parameters to methods in the post. raise_for_status () will indicate the success or failure

import requests


post_data = {'username': 'value1', 'password': 'value2'}

response = requests.post("http://xxx.com/login/", data=post_data)
response.raise_for_status()

Examples of post file:

>>> import requests
>>> url = 'http://httpbin.org/post'
>>> files = {'file': open('report.xls', 'rb')}
>>> r = requests.post(url, files=files)

Check the setting request header (headers):

Many sites have anti-climb mechanism, if a request does not carry the request header headers, it is likely to be banned.

import requests
headers = {

    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/"
                 "537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

response1 =requests.get("https://www.baidu.com", headers=headers)
response2 =requests.post("https://www.xxxx.com", data={"key": "value"}, 
headers=headers)

print(response1.headers)
print(response1.headers['Content-Type'])
print(response2.text)

Proxy settings Proxy:

Some sites limit the number of anti-climb mechanism of the same IP requests per unit of time, we can deal with this anti-climb mechanism by setting the IP proxy agent.

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

Cookie acquisition and added:

Sometimes we need to climb after take log in to access the page, and then we need a cookie to achieve simulated landing and the session lasted.

When a user first sends a request, the server typically generates and stores a short message, the data contained in the response. If this small piece of information stored in the client (browser or disk), we call the cookie. If this small piece of information stored on the server side, we call session (session) so that when the next time the user sends a request to a different when the page, the request will automatically bring the cookie, so that the server before the user has logged on to develop visited before.

You can view the cookie obtain content through print response.cookies, so they know after the first request whether a server generates a cookie.

 

 

 Add cookie-transmission request when:

  1. Parameter setting cookies
    import requests
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/"
                     "537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }
    
    cookies = {"cookie_name": "cookie_value", }
    response = requests.get("https://www.baidu.com", headers=headers, cookies=cookies)
  2. RequestCookieJar to instantiate a class, then the value set into it, and finally specify parameters inside cookie get, post methods

     

Session Session maintain:

session 与cookie不同,因为session一般存储在服务器端。session对象能够帮我们跨请求保持某些参数,也会在同一个session实例发出的所有请求之间保持cookies.

为了保持会话的连续,我们最好的办法是先创建一个session对象,用它打开一个url,而不是直接使用 request.get方法打开一个url. 

每当我们使用这个session对象重新打开一个url时,请求头都会带上首次产生的cookie,实现了会话的延续。

 

例子:

爬百度前20条搜索记录。(结果还是有点问题的,因为跳转的太多了,搜出不是对应的大条目)

#coding: utf-8
'''
爬取百度搜索前20个搜索页面的标题和链接
'''
import requests
import sys
from bs4 import BeautifulSoup as bs
import re
import chardet

headers = {
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}

def main(keyword):
    file_name = "{}.txt".format(keyword)
    f = open(file_name,'w+', encoding='utf-8')
    f.close()
    for pn in range(0,20,10):
        params = {'wd':keyword,'pn':pn}
        response = requests.get("https://www.baidu.com/s",params=params,headers=headers)
        soup = bs(response.content,'html.parser')
        urls = soup.find_all(name='a',attrs={"href": re.compile('.')})
        for i in urls:
            if 'http://www.baidu.com/link?url=' in i.get('href'):
                a = requests.get(url=i.get('href'),headers=headers)
                print(i.get('href'))
                soup1 = bs(a.content,'html.parser')
                title = soup1.title.string
                with open(keyword+'.txt','r',encoding='utf-8') as f:
                    if a.url not in f.read():
                        f = open(keyword+'.txt','a',encoding='utf-8')
                        f.write(title + '\n')
                        f.write(a.url + '\n')
                        f.close()

if __name__ == '__main__':
    keyword ='Django'
    main(keyword)
    print("下载完成")

 

Guess you like

Origin www.cnblogs.com/ahMay/p/11994608.html