[The road is one foot high, the magic is one foot high] How Python crawlers deal with website anti-crawling strategies

content

One, the core of a sentence

Second, the anti-anti-climbing technology I often use:

2.1 Simulate request headers

2.2 Forging request cookies

2.3 Random wait interval

2.4 Using proxy IP

2.5 Verification code cracking

3. The reptiles are well written, and they can eat all they can eat in prison?


Regarding the anti-crawling response to reptiles, I recently sorted out some experiences, wrote them down, and recorded them in review.

One, the core of a sentence

There are various strategies for dealing with anti-crawling, but the same is true. The core sentence is:

"The more the crawler is like a human operation, the less anti-crawling will be detected."

Second, the anti-anti-climbing technology I often use:

2.1 Simulate request headers

request header, the most critical item, User-Agent, can write an agent_list, and select an agent randomly for each request, like this:

agent_list = [
	"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
	"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
	"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
	"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
	"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
	"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
	"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
	"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
	"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
	"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
	"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
	]

When calling, just pick one at random:

'User-Agent': random.choice(agent_list)

Of course, you can also use fake_useragent (a random UA library integrated by predecessors) this library, but sometimes it is not easy to use, usually this kind of Error will be reported:

fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached

Don't panic when you encounter this kind of error. It's simple and rude, and the json content of fake_useragent is down to the local. The purpose is to get it from the website server instead of getting it locally, so as to avoid the error of timeout.

json access address: https://fake-useragent.herokuapp.com/browsers/0.1.11

2.2 Forging request cookies

When sending a request, add the item "cookie" to the request header to fake the illusion that you are logged in.

Where to get the "cookie" value? Also press F12 to open the browser developer mode, find the headers->request headers corresponding to the target address, find the cookie item, and copy the value.

View the cookie value of a webpage

Put it in the request header of the crawler code, like this:

Paste the cookie value into the crawler code

2.3 Random wait interval

After each request, sleep waits for a random amount of time, like this:

time.sleep(random.uniform(0.5, 1)) # 随机等待时间是0.5秒和1秒之间的一个小数

Try not to use sleep(1), sleep(3) to wait for an integer time, it looks like a machine at first glance. .
Again, let the crawler behave more like a human being!

2.4 Using proxy IP

Use proxy IP to solve anti-crawling. (Free agents are unreliable, it is best to use paid ones. There are charges by the number of times, and there are charges by the length of time. You can choose according to your own situation.)
What does it mean? It means that every time you send a request, it will make you send it from different regions. The same, the first time my ip address is Hebei, the second time is Guangdong, the third time is the United States. . . like this:

def get_ip_pool(cnt):
	"""获取代理ip的函数"""
	url_api = '获取代理IP的API地址'
	try:
		r = requests.get(url_api)
		res_text = r.text
		res_status = r.status_code
		print('获取代理ip状态码:', res_status)
		print('返回内容是:', res_text)
		res_json = json.loads(res_text)
		ip_pool = random.choice(res_json['RESULT'])
		ip = ip_pool['ip']
		port = ip_pool['port']
		ret = str(ip) + ':' + str(port)
		print('获取代理ip成功 -> ', ret)
		return ret
	except Exception as e:
		print('get_ip_pool except:', str(e))
proxies = get_ip_pool() # 调用获取代理ip的函数
requests.get(url=url, headers=headers, proxies={'HTTPS': proxies}) # 发送请求

In this way, the peer server will think that you/you are visitors from many regions, and even if the access is frequent, it may not crawl you back!

2.5 Verification code cracking

Regarding the verification code cracking, I suggest you read "Python3 Web Crawler Development Practice" written by Cui Qingcai

Among them, Chapter 8: Recognition of Verification Codes mentions the cracking of four types of verification codes:

  • 8.1 Recognition of graphic verification code
  • 8.2 Recognition of extreme sliding verification code
  • 8.3 Recognition of touch verification code
  • 8.4 Identification of Weibo Gongge Verification Code

In chapter 8.3, the author mentioned the use of the third-party coding platform Super Eagle platform, which I also applied to the following case.
Use a third-party coding platform to directly call its interface, saving worry and effort.
In order to crack Google's recaptcha verification code, I used this:

Call the image recognition coding method of Super Eagle. The general idea is:

  1. Save the captcha image element that pops up on the page and save the screenshot locally.
  2. According to the picture size requirements of the coding platform, use the PIL library to scale, crop and save.
  3. Send the processed picture to the coding platform server by calling the platform api. After the platform is successfully recognized, the coordinate value pair is returned, and the selenium library of python is used to click the corresponding coordinates in turn to complete the automatic identification of the verification code. (Logical judgment is required during this period. If the platform returns incorrectly, the click operation needs to be re-triggered until the verification is successful)


By the way, paste the python code:

def f_solve_captcha(v_infile, offset_x, offset_y, multiple=0.55):
	"""
	利用超级鹰识别验证码
	:param offset_x: x轴偏移量
	:param offset_y: y轴偏移量
	:param v_infile: 验证码图片
	:param multiple: 图片缩小系数
	:return: 验证码识别结果坐标list
	"""
	outfile = 'new-' + v_infile
	# 1、图片缩小到超级鹰要求:宽不超过460px,高不超过310px
	img = Image.open(v_infile)
	w, h = img.size
	w, h = round(w * multiple), round(h * multiple)  # 去掉浮点,防报错
	img = img.resize((w, h), Image.ANTIALIAS)
	img.save(outfile, optimize=True, quality=85)  # 质量为85效果最好
	print('pic smaller done!')
	# 2、调用超级鹰识别
	chaojiying = Chaojiying_Client(cjy_username, cjy_password, cjy_soft_id)
	im = open(outfile, 'rb').read()
	ret = chaojiying.PostPic(im, 9008)
	print(ret)
	loc_list2 = []
	if ret['err_no'] == 0:  # 返回码0代表成功
		loc_list = ret['pic_str'].split('|')
		for loc in loc_list:
			loc_x = round(int(loc.split(',')[0]) / multiple)
			loc_y = round(int(loc.split(',')[1]) / multiple)
			loc_list2.append((loc_x + offset_x, loc_y + offset_y))
		print('超级鹰返回正确,loc_list2 is:')
		print(loc_list2)
		print('长度是:{}'.format(len(loc_list2)))
	else:
		print('超级鹰返回错误!错误码:{},错误内容:{}'.format(ret['err_no'], ret['err_str']))
	return loc_list2


The reptile boss passing by, can you give me a trick, I feel that this method is too stupid, a bit stunned~


3. The reptiles are well written, and they can eat all they can eat in prison?


Whether or not the development technology of reptiles violates the law has been controversial. As technicians, we must always alert ourselves to what can and cannot be climbed. We must have a statement in our hearts:

  1. Before crawling, spend your precious 10 seconds, look at the robots.txt of the target page, if someone clearly writes User-agent: * Disallow: / , then crawling is your fault, right? . If you don't understand robots syntax, please move on: https://developers.google.com/search/reference/robots_txt?hl=zh-cn
  2. Do not crawl sensitive data, private data
  3. Don't use the data for commercial use, do the data analysis yourself, just practice your hands
  4. If the target website has an API interface open to the outside world, just use it directly, don't write your own web crawler. The front door is open for you, you have to climb in through the window, why? professional habit? ?
  5. Don't try to crawl, build it, keep it away, and don't affect the access to the peer server, or even downtime!

As programmers/programmers, we write code to make the world a better place, and we must be good citizens who abide by the law!

peace~

Synchronized public account articles:

[The road is one foot high, the magic is one foot high] How Python crawlers deal with website anti-crawling strategies Regarding the anti-crawling strategy of crawlers, I recently sorted out some experiences, wrote them down, and recorded them in review. First, a word to deal with anti-climb core strategies varied, but never deviate from https://mp.weixin.qq.com/s?__biz=MzU5MjQ2MzI0Nw==&mid=2247484238&idx=1&sn=e68f02ba613b0eea88da40e013d2f756&chksm=fe1e17aec9699eb81c0063dc0f259425cd3316deb1507ac2b0854426dbf9d10c7ee38c12e948&token=928226833&lang=zh_CN# rd


I am Ma Ge, and I have tens of thousands of fans on the entire network. Welcome to exchange python technology together.

Search " Ma Ge python said " on various platforms: Zhihu, Bilibili, Xiaohongshu, Sina Weibo.

Guess you like

Origin blog.csdn.net/solo_msk/article/details/124237459