A crawler (also known as a web spider, a network robot) is a program that simulates a client sending a network request and receiving a request response. It is a program that automatically grabs Internet information according to certain rules.
In principle, as long as the browser (client) can do anything, the crawler can do it.
According to the number of crawled websites, we divide crawlers into:
General crawler: usually refers to the crawler of the search engine
Focus on crawlers: crawlers for specific websites
Robots agreement: The website tells the search engine which pages can be crawled and which pages cannot be crawled through the Robots agreement, but it is only a constraint on the moral level.
The browser will actively request js, css and other content, js will modify the content of the page, js can also resend the request, and finally the content rendered by the browser is in elements, which includes css, pictures, js, responses corresponding to url addresses, etc. .
But in the crawler, the crawler will only request the url address, and get the response corresponding to the url address. The page rendered by the browser is different from the page requested by the crawler. Therefore, in the crawler, it is necessary to extract data based on the response corresponding to the url address.
URL format: scheme://host[:port#]/path/…/[?query-string][#anchor]
scheme: protocol (eg: http, https, ftp)
host: IP address or domain name of the server
port: The port of the server (if it is the default port of the protocol, 80 or 443)
path: the path to access the resource
query-string: parameter, the data sent to the http server
anchor: anchor (jump to the specified anchor position of the web page)
HTTP: Hypertext Transfer Protocol, default port number: 80
HTTPS: HTTP + SSL (Secure Sockets Layer), default port number: 443 (HTTPS is more secure than HTTP, but has lower performance)
Common HTTP request headers
Host (host and port number)
Connection (connection type)
Upgrade-Insecure-Requests (upgrade to HTTPS requests)
User-Agent (browser name)
Accept (transfer file type)
Referer (page jump)
Accept-Encoding (file codec format)
Cookie (Cookie)
x-requested-with :XMLHttpRequest (Ajax asynchronous request)
Common Request Methods
GET
POST
Some other notes about GET requests:
GET requests can be cached
GET requests remain in the browser history
GET requests can be bookmarked
GET requests should not be used when handling sensitive data
GET requests have a length limit
GET requests should only be used to retrieve data
Some other notes about POST requests:
POST requests are not cached
POST requests are not kept in browser history
POST cannot be bookmarked
POST request has no requirement on data length
#! -*- encoding:utf-8 -*-
import requests
# 要访问的目标页面
targetUrl = "http://ip.hahado.cn/ip"
# 代理服务器
proxyHost = "ip.hahado.cn"
proxyPort = "39010"
# 代理隧道验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host" : proxyHost,
"port" : proxyPort,
"user" : proxyUser,
"pass" : proxyPass,
}
proxies = {
"http" : proxyMeta,
"https" : proxyMeta,
}
resp = requests.get(targetUrl, proxies=proxies)
print resp.status_code
print resp.text