python- web crawler
- Basic network knowledge
- JSON (Method lightweight data-interchange format .. s are not involved with the file operation)
- URL (Uniform Resource Locator, uniform resource locator): Network Address
- urllib package (url + lib, and is mainly used parse request module)
- Site inspection
- Hide (Modify Headers)
- Web crawling with agents
- Download web content
- requests module (Python third party libraries, resource processing URL and more convenient)
Basic network knowledge
JSON (Method lightweight data-interchange format .. s are not involved with the file operation)
A, json.dumps (i): The data of the specific format of the operating string
For example, a list of dictionary string operation can be carried out and then write the json file; but if it is to write json file would have to be dumps operate;
Two, json.dump (): to convert the data into a format str, add a procedure to write data to a file json
son.dump (data, file-open, ascii = False), can include three properties, a third unicode ascii is used to avoid distortion of writing occurs;
Three, json.loads (): converting character data into the original data format, such as dictionaries and lists
Four, json.load (): read data from the file json, json.load (file-open) can, which can restore the original data format json file, such as a list or dictionary; file open at the time Note that with the best encoding = 'utf-8' encoding, so out data is the original data, without distortion;
URL (Uniform Resource Locator, uniform resource locator): Network Address
URL format: protocol:// hostname[:port] /path/[;parameters] [?query]#fragment
brackets are optional election
The first part is the protocol: http, https, ftp, file
The second part is storage resources of the domain name system server or IP address (sometimes have to include the port number, various transmission protocols have a default port number, http default is 80)
The third part is the specific location of a resource (such as a directory or file name)
urllib package (url + lib, and is mainly used parse request module)
parse module
urlencode (): convert to url dictionary format
request module
urlopen (url,) data = None): Open the URL address, returns a response (http.client.HTTPResponse object)
url parameter may be a string or a Request object url
data used to obtain get when None, when the data submitted for the post assignment
Probably response object comprising a read (), readinto (), getheader (), getheaders (), fileno (), msg, version, status, reason, debuglevel functions and closed,
Using the read () function returns the page content is not decoded (e.g., byte stream, or pictures)
read () with the decode () function using the decoding method corresponding to the obtained content, returns the corresponding object (e.g., returns a string utf-8)
geturl (), getcode () Gets frequency response, info ()
Request: better request object information, including headers (request header) information
request = urllib.request.Request(url = url,data = data, /
headers = headers,method = 'POST')
parameters required data bytes (byte stream) type, if a dictionary is, first urllib.parse.urlencode () coding.
data is empty, method defaults to get. data is not empty, method defaults to post
data = urllib.parse.urlencode(字典).encode('utf-8')
Site inspection
method (network-status right display method, and then click on a button class request):
post (Submit): Submit the processed data to the specified server
headers:
1) General:
remote address (port number), the request address (URL), the request method (POST)
2) Request Headers:
by User-Agent to determine whether the browser access (or access codes)
The main content submitted by post: 3) From data (form data)
get (Get): refers to the server to request data
Hide (Modify Headers)
1) Request parameter modified by the headers
2) modified by Request.add_header (key, value) method
User-Agent a request modify
Hidden first edition:
#伪装 从User-Agent得知
header={"User-Agent":" Mozilla/5.0 (Windows NT 6.1; Win64; x64) "}
#封装
response=request.Request(url=base_url,headers=header,data=data_str)
#创建一个Request对象
req=request.urlopen(response).read().decode("utf-8")
#响应对象,read成网页内容,decode返回req字符串
Web crawling with agents
Non-decoded version:
import urllib.request as r
response = r.urlopen('http://www.fishc.com')
response.read()
Agents (servers with multiple IP access other web pages, and data acquisition)
1 is a dictionary { 'type', 'proxy ip: port number'}
key: proxy type (e.g., http) values: the corresponding IP
proxy_suppot = urllib.request.ProxyHandler({})
2. Create a custom and opener
opener = urllib.request.build_opener (proxy_support)
Access is usually called opener, here we have a customized opener, with proxy ip to access web pages
3. Installation opener (with a permanent agent) urllib.request.install_opener (opener)
4. Call opener (using a special opener to open a Web page)
opener.open (url)
Acting first edition:
import urllib.request as r
url='https://www.kuaidaili.com/free/'
proxy_support = r.ProxyHandler({'http':'58.22.177.200:9999'})
opener = r.build_opener(proxy_support)
r.install_opener(opener)
response = r.urlopen(url)
html = response.read().decode('utf-8')
print(html)
First edition proxy + Hide
opener.addheaders[(key,value)]
例如:opener.addheaders[('User-Agent','*************')]
Download web content
Picture download the first edition:
import urllib.request as r
req =r.Request('http://placekitten.com/g/500/600')
response = r.urlopen(req)
cat_img = response.read()
with open('cat_500_600.jpg','wb') as f:
f.write(cat_img)
# fp=open(文件)等同于 with open(文件名) as fp
requests module (Python third party libraries, resource processing URL and more convenient)
https://www.cnblogs.com/lei0213/p/6957508.html
requests are based on urllib written in python language, using the HTTP protocol Apache2 Licensed open source libraries, requests more convenient than urllib
request support HTTP connection remains and connection pooling, support the use of a session cookie to maintain, support file upload, encoding support automatic response content, URL and POST data to support automatic coding internationalization.
requests.get()
requests.get (url, params = None, headers = None, cookies = None, auth = None, timeout = no)
send GET request. Returns the response object.
Parameters:
url - the URL of the new Request object.
params - (Optional) Use parameters GET request transmitted dictionary. (E.g., data required for data)
Header headers - (Optional) Use HTTP request header sent dictionary.
Such as: headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) ' }
Cookie - (Optional) Use CookieJar request objects sent.
auth - (optional) AuthObject enable basic HTTP authentication.
Timeout - (Optional) Description request timed floating point.
Without parameters:requests.get(url)
With parameters:requests.get(url=? params=字典 # 带参数的get请求
Request method
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")
requests.post()
A: mode application / x-www-form-urlencoded == most common post submitted data, data submitted in the form form form
url = 'http://httpbin.org/post'
data = {'key1':'value1','key2':'value2'}
r =requests.post(url,data)
B: application / json == submit data in the format json
url_json = 'http://httpbin.org/post'
data_json = json.dumps({'key1':'value1','key2':'value2'})
#dumps:将python对象解码为json数据
r_json = requests.post(url_json,data_json)
C: multipart / form-data == generally used to upload files (less common)
url = 'http://httpbin.org/post'
files = {'file':open('E://report.txt','rb')}
r = requests.post(url,files=files)
response模块(requests.response)
The object is requests.reponse
printing response: <Response [200]>
request.content与request.text
response.text: unicode returns a text data type (type may be str), generally need to be converted to utf-8 format, otherwise distortion
response.enconding = "utf-8'
response.content: Returns the type of binary data bytes (to take pictures, video, files),
direct writing does not require decoding, to see if the need to decode utf-8 format.
response.content.decode()
Returns utf-8 string (which may be decoded according to the default mode)
response is a response object, post also content
response.json()
Equivalent to json.loads (response.text) Method (a string into a dictionary)
Intrinsic properties
The role of a cookie is to be used for simulated landing, do maintain session
#打印请求页面的状态(状态码)
print(type(response.status_code),response.status_code)
#打印请求网址的headers所有信息
print(type(response.headers),response.headers)
#打印请求网址的cookies信息
print(type(response.cookies),response.cookies)
#打印请求网址的地址
print(type(response.url),response.url)
#打印请求的历史记录(以列表的形式显示)
print(type(response.history),response.history)
#解码方式
response.encoding
Normal status code 200:
#如果response返回的状态码是非正常的就返回404错误
if response.status_code != requests.codes.ok:
print('404')
#如果页面返回的状态码是200,就打印下面的状态
response = requests.get('http://www.jianshu.com')
if response.status_code == 200:
print('200')
proxy
1, ordinary proxy settings
import requests
proxies = {
"http": "http://127.0.0.1:9743",
"https": "https://127.0.0.1:9743",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
2, set the user name and password for proxy
import requests
proxies = {
"http": "http://user:[email protected]:9743/",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
Socks proxy settings
Installation module socks pip3 install 'requests [socks]'
import requests
proxies = {
'http': 'socks5://127.0.0.1:9742',
'https': 'socks5://127.0.0.1:9742'
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
Timeout settings
通过timeout参数可以设置超时的时间
import requests
from requests.exceptions import ReadTimeout
try:
# 设置必须在500ms内收到响应,不然或抛出ReadTimeout异常
response = requests.get("http://httpbin.org/get", timeout=0.5)
print(response.status_code)
except ReadTimeout:
print('Timeout')
Authentication Settings
If you hit a site requires authentication module can be achieved by requests.auth
import requests
from requests.auth import HTTPBasicAuth
<br>#方法一
r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))<br>
#方法二<br>r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)
certificate
When requesting https, request a certificate will be verified, it will throw an exception if the validation fails
Close certificate validation:
import requests
# 关闭验证,但是仍然会报出证书警告
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)
Silence the alarm verification certificate:
from requests.packages import urllib3
import requests
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)
Manually set the certificate:
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)