Reptiles and anti-climb Battle of python reptile, selenium and phantomjs

First, reptiles and anti-climb struggle - anti-climb policy
1, anti-climb policy
(1) be determined by user-agent client identifier is not a reptile.
The solution: a package request header: User-Agent
(2) sealing ip
Solution: setting agent ip
(. 3) to determine whether the access frequency by a non-human request.
The solution: set the interval crawling and crawling strategies.
(4) Verification Code
Solution: identification codes
(5) no longer directly rendered page data, acquired by the distal js asynchronous
solution:
A data acquired by selenium + phantomjs
Interface (Ajax Interface) b find data sources
2, page the technology
(1) js: he is a language.
Get elements of the page, you can do operations on these page elements.
Access to network data
(2) jquery: he is a js library that can make programming easier js easy.
(3) ajax
synchronous and asynchronous requests.

Second, what selenium?

			selenium是一个web自动化测试工具。但是他本身不带浏览器的功能,这个工具就相当于一个驱动程序,通过这工具可以帮助我们自动操作一些具有浏览器功能的外部应用。

Third, what is phantomjs?

	phantomjs:内置的无界面浏览器引擎。他可以像浏览器那样加载页面,运行页面中的js代码。
	chromedriver.exe:是谷歌浏览器驱动程序,通过这个程序可以使得selenium可以调用chrome浏览器。---有界面浏览器。
	
	这两个的功能有界面的更加强大:很多网站可以识别你是不是用phantomjs来进行爬取的,会被禁。
		但是有界面的chrome浏览器是不会被禁的,他就像一个真正用户在请求一样。

Four, selenium and phantomjs installation.

		(1)下载phantomjs和chromedriver.exe
			搜索phanomjs镜像
		(2)安装:
			解压
			找到两个压缩包中exe文件,将其复制到anaconda/Scripts目录下面就ok了。
			C:\Anaconda3\Scripts
			C:\Anaconda3
		(3)测试:
			在cmd中输入:phantomjs
						 chromedriver
		 (4)selenium安装:pip isntall selenium==版本号
Released five original articles · won praise 1 · views 236

Guess you like

Origin blog.csdn.net/Sadi_/article/details/104363436