Reptile formal learning day2

Basics summary:

 1. What crawler is?

  Is a follow certain rules, automatically crawl the Web information program or script.

2. The crawler type

 (1) General Reptile: simple terms is possible; all the pages on the Internet downloaded, put the local server in the formation of Backup, do the relevant treatment in these pages (extract keywords, remove the ad), and finally It provides a user search interface. 

 (2) focused crawler: crawl specify the content of the specified page, it is the most common reptiles.

 (3) Incremental reptiles: scanning local crawled the url, url crawl newly added page. Reduce data downloads, to update the page has to crawl, to reduce cost in time and space, but increases the complexity of the algorithm and implementation difficulty crawling.

 (4) Deep Web crawler: crawl hidden in the search form, only the user submits a Web page to get some key words, the analysis reptiles crawl after filling out a form.

3. Anti-reptile means (excerpt from raging bear children  https://www.jianshu.com/p/afd873a42b2d )

A .BAN IP

Operation and maintenance personnel found the log pages by analyzing recent one particular IP traffic is particularly large, a certain period of time to visit numerous web pages, the operation and maintenance personnel determine that such behavior is not access normal behavior, so directly on the server blocked this person IP.

Solution: This method is extremely easy to accidentally injure other normal users, because other users a certain area may have the same IP, resulting in much less normal user access to the server, so the general operation and maintenance personnel will not be limited by this method reptile . But the face of a lot of a lot of access, the server still occasionally put the IP blacklisting, over a period of time and then put it out, but we can be a good solution to a distributed crawling and purchasing agents IP, but reptiles the costs increased.

二.BAN USERAGENT

Many reptiles request headers is the default of some very obvious reptile head python-requests / 2.18.4, and so, when the operation and maintenance personnel found carrying this type of packet headers, direct access refusal to return a 403 error

Solution: Direct r = requests.get (url, headers = { 'User-Agent': 'XXXspider'}) the request headers crawler impersonate another browser or another crawler on the line head.

Case: Snowball Network

Three .BAN COOKIES

Server for each person accessing the web page are set-cookie, to one of its cookies, the cookies when access to more than a certain threshold on BAN out the COOKIE, over a period of time and put out, of course reptiles are usually carried out without COOKIE visit, but there is a part of the web page content, such as Sina Weibo user login is needed to see more (I've caught up).

Solution: Control access speed, or some need to log in as Sina Weibo, buy multiple accounts on a treasure, generate multiple cookies, bring cookies at every visit

Case: ant nest.

IV. CAPTCHA

When a user visits too, will automatically jump to a request to allow verification code page, only to continue to access the site after entering the correct code

Solution: python as possible (pytesser, PIL) is carried out by a number of third-party libraries for code verification process, identify the correct code, the code can be complicated by machine learning to let crawlers automatically identify complex code, let the program automatically recognizes and automatically enter PIN codes continue to crawl

Case: 51Job passenger safety

Five .javascript rendering

Web developers will be important information on a Web page but does not write html tags, and the browser will automatically render js code <script> tag that information be presented in the browser which, while reptiles are not equipped to perform js code, the information can not be read out event generated js

Solution: js code extracted by analyzing a script to extract information through regular match content directly or phantomjs headless browser to render web pages through webdriver +.

Case: worry-free future network

Just open a worry-free future work interface directly with requests.get to access it, you can get one of the 20 or so data clearly been incomplete, and access the same page with webdriver can get 50 complete job information.

Six .ajax Asynchronous Transfer

访问网页的时候服务器将网页框架返回给客户端,在与客户端交互的过程中通过异步ajax技术传输数据包到客户端,呈现在网页上,爬虫直接抓取的话信息为空

解决办法:通过fiddler或是wireshark抓包分析ajax请求的界面,然后自己通过规律仿造服务器构造一个请求访问服务器得到返回的真实数据包。

案例:拉勾网  

打开拉勾网的某一个工作招聘页,可以看到许许多多的招聘信息数据,点击下一页后发现页面框架不变化,url地址不变,而其中的每个招聘数据发生了变化,通过chrome开发者工具抓包找到了一个叫请求了一个叫做http://www.lagou.com/zhaopin/Java/2/?filterOption=3的网页,打开改网页发现为第二页真正的数据源,通过仿造请求可以抓取每一页的数据。

七.加速乐

有些网站使用了加速乐的服务,在访问之前先判断客户端的cookie正不正确。如果不正确,返回521状态码,set-cookie并且返回一段js代码通过浏览器执行后又可以生成一个cookie,只有这两个cookie一起发送给服务器,才会返回正确的网页内容。

解决办法 :将浏览器返回的js代码放在一个字符串中,然后利用nodejs对这段代码进行反压缩,然后对局部的信息进行解密,得到关键信息放入下一次访问请求的头部中。

案例:加速乐

这样的一个交互过程仅仅用python的requests库是解决不了的,经过查阅资料,有两种解决办法:

1.将返回的set-cookie获取到之后再通过脚本执行返回的eval加密的js代码,将代码中生成的cookie与之前set-cookie联合发送给服务器就可以返回正确的内容,即状态码从521变成了200。

2.将返回的set-cookie获取到之后再通过脚本执行返回的eval加密的js代码,将代码中生成的cookie与之前set-cookie联合发送给服务器就可以返回正确的内容,即状态码从521变成了200。

 

 

 

Guess you like

Origin www.cnblogs.com/lst-315/p/11496366.html