urllib combat ---- use a proxy server to crawl web pages (021)

One: Proxy server:

A proxy server is a server in the middle of our Internet. If we use a proxy server, we first send a request to the proxy server when browsing information, and then the proxy server obtains the information from the Internet and returns it to us.

If we access the Internet, it is direct access. After the Internet server has information, it will return the information to the user. At this time, it is easy to obtain the information, but if your IP address is accessed a lot in a certain period of time, the server will feel If you are malicious, your IP will be blocked. At this time, when accessing the server, nothing can be crawled. Generally speaking, the ip is fixed. If it is an APN dial-up, you need to change an ip every time you dial, but there are also certain problems. The dialing also costs money, and the ip addresses of the dial-up are very similar, so people will block a bunch of ips. Lose. Now we have a proxy server that provides you to browse the web. First, you send a request to the proxy server, the proxy server gets information from the Internet, the Internet returns to the proxy server, and the proxy server returns it to us. At this point, the IP address of the network is the proxy server, then we can use multiple different proxy servers to crawl multiple websites. This way the ip will not be blocked.

Two: Actual combat

(1) Find the proxy IP, there are a lot of proxy IPs in http://www.xicidaili.com


The code to crawl Baidu web pages:


Note: If the actual connection fails because the ip proxy server address is invalid, you need to refresh to obtain the proxy ip


Three: To sum up the above code:

from urllib import request #Set
a custom function, which can implement the function of ip proxy in the function, which can be called directly in the future
#ProxyHandler method contains a dictionary, http://www.xicidaili.com has a large number of proxy ip #Internet
default port 80
proxy_addr="119.28.112.130:3128"
url="http://www.baidu.com"
def use_proxy(url,proxy_addr):
   proxy=request.ProxyHandler({"http":proxy_addr})
   # Add proxy ip
   opener=request.build_opener(proxy,request.HTTPHandler ) #Install
  opener as global
   request.install_opener(opener)
   request.urlopen(url).read().decode("utf-8","ignore")
   return data

#call
data=use_proxy(url,proxy_addr)
print(len(data))

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325548704&siteId=291194637