python3 reptile series of 19 anti-climbing random use User-Agent and agent pool of ip

Master Information Platform : python3 reptile series Random User-Agent 19 and use the agent pool of ip
we've talked about a few reptiles and more growth process, reptiles accelerated process pool usage and the like of it, will bring some bad thing.

1. Introduction
For instance, with our reptilian faster and faster, a lot of the time, it was found that the data can not climb it, print it out to see.

It does not return data, but also dumped sentence

Is not very familiar ah?

To want to see, how people access the site? Send a request for, it will with

request.headers,

So when a site you crazy request someone else when people site managers will feel a little wrong,

He looked header information requested, a look surprised, and seeing the headers information is as follows:

Host: 127.0.0.1:3369
User-Agent: python-requests/3.21.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
1
2
3
4
5
看到:

User-Agent: python-requests/3.21.0

Actually use python library to request, show that you have been exposed, it does not cover you when pigs fly?

So how to do it? Disguise themselves chant.

python not camouflage, camouflage browser, it is possible to modify the browser request header.

In simple terms, it is to make their python reptile pretending to be a browser.

2. Which place of camouflage Header?
Make your own python reptile pretending to be a browser, we have to pretend headers, then there are many fields headers, we mainly pay attention to that a few do?

usually two headers data can be highly recommended in reptiles for each request comes with a user-agent, and the 'Referer' If you need to add, do not need to do. (What Referer that? Behind supplement knowledge)

Icon:

Several important points above are explained as follows:

Headers Requests:
• "I was man!" - modify user-agent: inside the store is the model system and browser version, and by modifying it to pretend to be a man.

• "I come from Taiwan" - Modify referer: you tell the server through which URL points to come in and not appear out of thin air, some sites will check.

• "Cookies!": - Bring cookie, sometimes with or without the results obtained are different biscuits, biscuits with trying to "bribe" server let her give you complete information.

3.headers camouflage - Random User-Agent
reptiles mechanism: Many sites will Headers on the User-Agent is detected, as well as part of the site Referer will be detected (some of the resources of the site is to detect anti-hotlinking Referer)

User-Agent to generate random: generates a random User-Agent, so you can be a lot of different browsers look.

(Code is ready, take it with a copy to)

#!/usr/bin/python3

# @ Readme: anti-climb the headers of camouflage

# For the detection of anti-Headers reptile
from fake_useragent import UserAgent # Download: pip install fake-useragent

ua = UserAgent () # instantiated, but need networking site is not stable - may take longer time consuming

# 1 generate the specified browser request header
Print (ua.ie)
Print (ua.opera)
Print (ua.chrome)
Print (ua.google)
Print (ua.firefox)
Print (ua.safari)
# printing a random browser User-Agent
Print (ua.random)
Print ( 'complete')

# 2 is used in the work mode is ua.random
Import Requests
UA UserAgent = ()
Print (ua.random) randomly generated #

= {headers
'the User-- Agent': # ua.random camouflage
}

# 请求
url = 'https://www.baidu.com/'
response = requests.get(url, headers=headers)
print(response.status_code)

. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
15
16
. 17
18 is
. 19
20 is
21 is
22 is
23 is
24
25
26 is
27
28
29
30
31 is
32
33 is
34 is
the Referer disguise:

If you want to climb the picture, the picture anti-hotlinking, then you must use the Referer.

headers = { 'User-Agent' : ua.random, 'Referer': ' image into a main page where'}
1
if they are images of the security chain, the general idea is to climb up the addresses of all the pictures .jpg - > store them in the list -> traversing the access address of the picture, and then open the file for writing 'wb' format, the file name change dynamically according to the picture address.

This basically if your object is not very serious reptile picture, it did not use.

4.ip to be banned? Use ip proxy
sometimes just camouflage headers, using a random User-Agent request will be to find the same ip address, too many times visited, ip will be blocked, you use other ip continue to visit.

4.1 requests the proxy access
First, requests to the library in python proxy access example.

Ip using a proxy to access the web site as follows:

You must first define your own proxy IP Agent:

= {proxie
'HTTP': 'http://xx.xxx.xxx.xxx:xxxx',
'HTTP': 'http://xxx.xx.xx.xxx:xxx',
....
}
. 1
2
. 3
. 4
. 5
then requests + proxy requesting a web page:

= requests.get the Response (url, Proxies Proxies =)
1
so that you can use a proxy address that you define to visit the site.

However, ip address, is unique, engage in a bunch of ip address where to use it?

There are many free online proxy, the proxy IP is very unstable. If you have money, then buy directly on the line.


4.2 at no cost? IP agent pool that is
if you do not want to spend money, want to use ip crawlers. It can only be the entire pool of ip proxy.

4.2.1 self-built proxy ip pool - multi-threaded crawler
is their own to collect publicly available online free ip, ip proxy own self-built pool.

It is through the python program to grab a lot of free online proxy ip, then timing to detect these ip it possible to use, then the next time you want to use a proxy ip, you just need to go your own ip agent pool on the line to get inside.

In simple terms: access free agency website -> Regular / xpath extract ip and port -> Test ip is available, "" Save the available "" use ip reptiles> expired, abandoned ip.

This process can use multiple threads or asynchronous manner, because the detection agent is a very slow process.

This is a multi-threaded crawler ip proxy agent from west stab network :( I do not)

From: https: //www.jianshu.com/p/2daa34a435df

#!/usr/bin/python3

# @ Readme: IP proxy == simulate a ip address to visit a website (too many times to climb, ip be blocked)

# Ip constructed multi-threaded agent pool.
BeautifulSoup BS4 Import from
Import Requests
from urllib Import Request, error
Import Threading

import os
from fake_useragent import UserAgent


inFile = open ( 'proxy.txt') # storage reptiles down ip
verifiedtxt = Open ( 'verified.txt') # storage available proven ip

 

lock = threading.Lock()

getProxy DEF (url):
# Open txt file we created
proxyFile = open ( 'proxy.txt', 'a')

# 伪装
ua = UserAgent()
headers = {
'User-Agent': ua.random
}

# Page is the ip how many pages we need to get, where we get to page 9
for page in the Range (1, 10):
# by observing the URL, we found that the original URL + page is the URL we need, here's page needs str type converted
URLs URL + = str (page)
# source code acquired by the page Requests
RSP = requests.get (URLs, headers = headers)
html = rsp.text
# through BeautifulSoup, parsing html page
soup = BeautifulSoup (html, 'html.parser')
# we found by analyzing data in the table id is ip_list tr tag label
trs = soup.find ( 'table', id = 'ip_list'). find_all ( 'tr') # is obtained here the list is a list of
# we cycle the list
for Item in trs [1:]:
# and at least all td tags in each tr
TDS = item.find_all ( 'td')
# we will find that there are some img tags empty, so here we need to add a determined
IF TDS [0] .find ( 'IMG') None iS:
Nation = 'unknown'
the locate = 'unknown'
the else:
TDS = Nation [0] .find ( 'IMG') [ 'Alt']. Strip ()
the locate TDS = [. 3] .text.strip ()
# td by which the list of data, we were to extract them out
ip = TDS [. 1] .text.strip ()
Port TDS = [2] .text.strip ()
anony TDS = [. 4] .text.strip ()
Protocol TDS = [. 5] .text.strip ()
Speed = TDS [ . 6] .find ( 'div') [ 'title']. Strip ()
Time TDS = [. 8] .text.strip ()
# the acquired data is written in the format as specified txt text, we have obtained so easily
proxyFile .write ( '% s |% s |% s |% s |% s |% s |% s |% s \ n'% (nation, ip, port, locate, anony, protocol, speed, time))


def verifyProxyList():
verifiedFile = open('verified.txt', 'a')

while True:
lock.acquire()
ll = inFile.readline().strip()
lock.release()
if len(ll) == 0: break
line = ll.strip().split('|')
ip = line[1]
port = line[2]
realip = ip + ':' + port
code = verifyProxy(realip)
if code == 200:
lock.acquire()
print("---Success成功:" + ip + ":" + port)
verifiedFile.write(ll + "\n")
lock.release()
else:
print("---Failure失败:" + ip + ":" + port)


verifyProxy DEF (IP):
'' '
to verify the validity of the proxy
' ''
UA UserAgent = ()
RequestHeader = {
'UserAgent': ua.random
}
URL = "http://www.baidu.com"
# Fill agent address
proxy = { 'HTTP': IP}
# Create ProxyHandler
proxy_handler = request.ProxyHandler (proxy)
# Create opener
proxy_opener = request.build_opener (proxy_handler)
# mounted opener
request.install_opener (proxy_opener)

try:
req = request.Request(url, headers=requestHeader)
rsq = request.urlopen(req, timeout=5.0)
code = rsq.getcode()
return code
except error.URLError as e:
return e


if __name__ == '__main__':
# 手动新建两个文件
filename = 'proxy.txt'
filename2 = 'verified.txt'
if not os.path.isfile(filename):
inFile = open(filename, mode="w", encoding="utf-8")
if not os.path.isfile(filename2):
verifiedtxt = open(filename2, mode="w", encoding="utf-8")
tmp = open('proxy.txt', 'w')
tmp.write("")
tmp.close()
tmp1 = open('verified.txt', 'w')
tmp1.write("")
tmp1.close()
# 多线程爬虫西刺代理网,找可用ip
getProxy("http://www.xicidaili.com/nn/")
getProxy("http://www.xicidaili.com/nt/")
getProxy("http://www.xicidaili.com/wn/")
getProxy("http://www.xicidaili.com/wt/")

all_thread = []
for i in range(30):
t = threading.Thread(target=verifyProxyList)
all_thread.append(t)
t.start()

for t in all_thread:
t.join()

inFile.close()
verifiedtxt.close()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
1 16
117
1 18
119
120
121
122
123
124
125
126
127
128
129
130.
131 is
132
133
134
135
136
137
138
139
140
141 is
142
143
144
145
(the code may be used, posted to our reference later may be modified to themselves of.)

Run and it worked:


Climb out of the available little or very short:

So this approach is not recommended.

4.2.2 open source library of ip agent pool - asynchronous async-proxy-pool
open source library async-proxy-pool is asynchronous reptile agent pool to Python asyncio-based, designed to take advantage of asynchronous Python's performance.

Sync than asynchronous processing can improve the efficiency of hundreds of times.

Download and tutorial:
the async-Proxy-the pool

Operating Environment

Project uses sanic, an asynchronous network framework. It is recommended to run Python environment Python3.5 +, and sanic does not support Windows systems, Windows users (like me smile) consider using Ubuntu on Windows.

That I will not be happy, and this stuff does not support Windows system, if you do not win the. Then you can continue to do according to his official website. (I believe most people are still windows, I do not speak.)

4.2.3 open source ip agent pool -ProxyPool (recommended blood)
analogy thread pool, pool process, understand it?
This is a good open-source ip agent pool ProxyPool I found, you can use the windows system, at least Python3.5 above ambient yo, also need to open Redis service.

Ready-agent pool, not to them?

ProxyPool Download:

https://github.com/Python3WebSpider/ProxyPool.git

(You can manually download can also use git down.)

1.ProxyPool use:

The first to use git clone source code pull you locally,

clone https://github.com/Python3WebSpider/ProxyPool.git Git
. 1
2. Then open project setting.py, where information can be configured, such as an address related to the Redis password.

3. Enter proxypool directory, modify settings.py file, PASSWORD as Redis password, if empty, is set to None. (Newly installed redis generally do not have a password.)

(If you do not redis, you can go to download to install, come and see.)

(Assuming your redis already installed.)

4. Then you clone down the files in the directory (is this ProxyPool stored computer path)

5. Install the required correlation dependencies:
(PIP3 or PIP)

install -r requirements.txt PIP
1
(if you ProxyPool import pycharm inside, it's all in pycharm inside out on it.)

Need to download are:


6. Next open your redis service,

Direct dos cmd window open, run: redis-server.exe
to open redis server. The default port is 6379 redis


7. Then you can run the run.py.

Which can run cmd command mode, you can also import pycharm inside run.

Icon:


8. After running run.py, you can open your redis management tools, or enter redis inside view, this time in your redis in many stores would have crawled to the proxy ip:


9. After the project up and running, do not stop [], then redis which kept the ip, you can access the agent pool.

In the above chart, you can see there are so many words

Running on http://0.0.0.0:5555/ (Press CTRL + C to quit)
this is telling us is how much random access URL address.
10. Random obtain a proxy ip address in your browser:

You enter the browser:

http://0.0.0.0:5555/random

After this visit will be to get a random proxy ip.

Icon:


11. Get a random code ip agent

that's it:

import requests
# 随机ip代理获取
PROXY_POOL_URL = 'http://localhost:5555/random'
def get_proxy():
try:
response = requests.get(PROXY_POOL_URL)
if response.status_code == 200:
return response.text
except ConnectionError:
return None

if __name__ == '__main__':
print(get_proxy())

. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
shown:

All right. To this end.

Use the ip agent pool, now, it is the best, but also free and efficient Oh ~ ~ ~

The given solution
when installed, if given similar to the following:

AttributeError: ‘int’ object has no attribute 'items

Xxx software update at the corresponding version, for example redis version:

== Redis install 3.33.1 PIP
1
Well, here, we have succeeded in obtaining a proxy agent pool of ip, ip seal no longer have to fear, because we have a lot of ip can be used.

(^ _ ^) / ~ ~ Bye
----------------
Disclaimer: This article is the original article CSDN bloggers "csdnzoutao", and follow CC 4.0 BY-SA copyright agreement, reprint Please include links to the original source and this statement.
Original link: https: //blog.csdn.net/ITBigGod/article/details/103248172

Guess you like

Origin www.cnblogs.com/1994jinnan/p/11955129.html