python- reptiles basis

Basic network knowledge

JSON (Method lightweight data-interchange format .. s are not involved with the file operation)

A, json.dumps (i): The data of the specific format of the operating string

For example, a list of dictionary string operation can be carried out and then write the json file; but if it is to write json file would have to be dumps operate;
  
 

Two, json.dump (): to convert the data into a format str, add a procedure to write data to a file json

son.dump (data, file-open, ascii = False), can include three properties, a third unicode ascii is used to avoid distortion of writing occurs;


 

Three, json.loads (): converting character data into the original data format, such as dictionaries and lists



Four, json.load (): read data from the file json, json.load (file-open) can, which can restore the original data format json file, such as a list or dictionary; file open at the time Note that with the best encoding = 'utf-8' encoding, so out data is the original data, without distortion;




URL (Uniform Resource Locator, uniform resource locator): Network Address

URL format: protocol:// hostname[:port] /path/[;parameters] [?query]#fragment
brackets are optional election

The first part is the protocol: http, https, ftp, file

The second part is storage resources of the domain name system server or IP address (sometimes have to include the port number, various transmission protocols have a default port number, http default is 80)

The third part is the specific location of a resource (such as a directory or file name)



urllib package (url + lib, and is mainly used parse request module)

parse module

urlencode (): convert to url dictionary format



request module

urlopen (url,) data = None): Open the URL address, returns a response (http.client.HTTPResponse object)

url parameter may be a string or a Request object url

data used to obtain get when None, when the data submitted for the post assignment

Probably response object comprising a read (), readinto (), getheader (), getheaders (), fileno (), msg, version, status, reason, debuglevel functions and closed,

Using the read () function returns the page content is not decoded (e.g., byte stream, or pictures)

read () with the decode () function using the decoding method corresponding to the obtained content, returns the corresponding object (e.g., returns a string utf-8)


geturl (), getcode () Gets frequency response, info ()



Request: better request object information, including headers (request header) information
request = urllib.request.Request(url = url,data = data, /
headers = headers,method = 'POST')

parameters required data bytes (byte stream) type, if a dictionary is, first urllib.parse.urlencode () coding.
data is empty, method defaults to get. data is not empty, method defaults to post

data = urllib.parse.urlencode(字典).encode('utf-8')


Site inspection

method (network-status right display method, and then click on a button class request):


post (Submit): Submit the processed data to the specified server

headers:

1) General:
remote address (port number), the request address (URL), the request method (POST)

2) Request Headers:
by User-Agent to determine whether the browser access (or access codes)

The main content submitted by post: 3) From data (form data)


get (Get): refers to the server to request data



Hide (Modify Headers)

1) Request parameter modified by the headers

2) modified by Request.add_header (key, value) method

User-Agent a request modify

Hidden first edition:

   #伪装 从User-Agent得知
        header={"User-Agent":" Mozilla/5.0 (Windows NT 6.1; Win64; x64) "}


        #封装
        response=request.Request(url=base_url,headers=header,data=data_str) 
           #创建一个Request对象



        
        req=request.urlopen(response).read().decode("utf-8")   

  #响应对象,read成网页内容,decode返回req字符串


Web crawling with agents

Non-decoded version:

import urllib.request as r
response = r.urlopen('http://www.fishc.com')
response.read()

Agents (servers with multiple IP access other web pages, and data acquisition)

1 is a dictionary { 'type', 'proxy ip: port number'}
key: proxy type (e.g., http) values: the corresponding IP

proxy_suppot = urllib.request.ProxyHandler({})


2. Create a custom and opener
opener = urllib.request.build_opener (proxy_support)

Access is usually called opener, here we have a customized opener, with proxy ip to access web pages


3. Installation opener (with a permanent agent) urllib.request.install_opener (opener)

4. Call opener (using a special opener to open a Web page)
opener.open (url)


Acting first edition:

import urllib.request as r

url='https://www.kuaidaili.com/free/'

proxy_support = r.ProxyHandler({'http':'58.22.177.200:9999'})

opener = r.build_opener(proxy_support)

r.install_opener(opener)

response = r.urlopen(url)

html = response.read().decode('utf-8')

print(html)


First edition proxy + Hide

opener.addheaders[(key,value)]

例如:opener.addheaders[('User-Agent','*************')]



Download web content

Picture download the first edition:

import urllib.request as r

req =r.Request('http://placekitten.com/g/500/600')

response = r.urlopen(req)

cat_img = response.read()

with open('cat_500_600.jpg','wb') as f:
      f.write(cat_img)


# fp=open(文件)等同于 with open(文件名) as fp 



requests module (Python third party libraries, resource processing URL and more convenient)

https://www.cnblogs.com/lei0213/p/6957508.html

requests are based on urllib written in python language, using the HTTP protocol Apache2 Licensed open source libraries, requests more convenient than urllib

request support HTTP connection remains and connection pooling, support the use of a session cookie to maintain, support file upload, encoding support automatic response content, URL and POST data to support automatic coding internationalization.


requests.get()

requests.get (url, params = None, headers = None, cookies = None, auth = None, timeout = no)
send GET request. Returns the response object.

Parameters:
url - the URL of the new Request object.
params - (Optional) Use parameters GET request transmitted dictionary. (E.g., data required for data)

Header headers - (Optional) Use HTTP request header sent dictionary.

Such as: headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) ' }


Cookie - (Optional) Use CookieJar request objects sent.
auth - (optional) AuthObject enable basic HTTP authentication.
Timeout - (Optional) Description request timed floating point.

Without parameters:requests.get(url)

With parameters:requests.get(url=? params=字典 # 带参数的get请求


Request method

requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

requests.post()

A: mode application / x-www-form-urlencoded == most common post submitted data, data submitted in the form form form

url = 'http://httpbin.org/post'
data = {'key1':'value1','key2':'value2'}
r =requests.post(url,data)

B: application / json == submit data in the format json

url_json = 'http://httpbin.org/post'
data_json = json.dumps({'key1':'value1','key2':'value2'})  
#dumps:将python对象解码为json数据

r_json = requests.post(url_json,data_json)


C: multipart / form-data == generally used to upload files (less common)

url = 'http://httpbin.org/post'
files = {'file':open('E://report.txt','rb')}
r = requests.post(url,files=files)



response模块(requests.response)

The object is requests.reponse
printing response: <Response [200]>

request.content与request.text

response.text: unicode returns a text data type (type may be str), generally need to be converted to utf-8 format, otherwise distortion

response.enconding = "utf-8'

response.content: Returns the type of binary data bytes (to take pictures, video, files),
direct writing does not require decoding, to see if the need to decode utf-8 format.

response.content.decode() Returns utf-8 string (which may be decoded according to the default mode)



response is a response object, post also content



response.json()

Equivalent to json.loads (response.text) Method (a string into a dictionary)


Intrinsic properties

The role of a cookie is to be used for simulated landing, do maintain session

#打印请求页面的状态(状态码)
print(type(response.status_code),response.status_code)

#打印请求网址的headers所有信息
print(type(response.headers),response.headers)

#打印请求网址的cookies信息
print(type(response.cookies),response.cookies)

#打印请求网址的地址
print(type(response.url),response.url)

#打印请求的历史记录(以列表的形式显示)
print(type(response.history),response.history)

#解码方式
response.encoding

Normal status code 200:

#如果response返回的状态码是非正常的就返回404错误
if response.status_code != requests.codes.ok:
    print('404')
 
 
#如果页面返回的状态码是200,就打印下面的状态
response = requests.get('http://www.jianshu.com')
if response.status_code == 200:
    print('200')




proxy

1, ordinary proxy settings

import requests
 
proxies = {
  "http": "http://127.0.0.1:9743",
  "https": "https://127.0.0.1:9743",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

2, set the user name and password for proxy
import requests
 
proxies = {
    "http": "http://user:[email protected]:9743/",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

Socks proxy settings

Installation module socks pip3 install 'requests [socks]'

import requests
 
proxies = {
    'http': 'socks5://127.0.0.1:9742',
    'https': 'socks5://127.0.0.1:9742'
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)



Timeout settings

  通过timeout参数可以设置超时的时间
import requests
from requests.exceptions import ReadTimeout
 
try:
    # 设置必须在500ms内收到响应,不然或抛出ReadTimeout异常
    response = requests.get("http://httpbin.org/get", timeout=0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')

Authentication Settings

If you hit a site requires authentication module can be achieved by requests.auth

import requests
from requests.auth import HTTPBasicAuth
<br>#方法一
r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))<br>
#方法二<br>r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)





certificate

When requesting https, request a certificate will be verified, it will throw an exception if the validation fails

Close certificate validation:

import requests
# 关闭验证,但是仍然会报出证书警告
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)


Silence the alarm verification certificate:

from requests.packages import urllib3
import requests
 
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)


Manually set the certificate:
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)


Published 46 original articles · won praise 15 · views 30000 +

Guess you like

Origin blog.csdn.net/qq_41850194/article/details/104533268