Learning Java Web crawler, which need the basics?

Speaking of web crawlers, we think the estimates are Python, admittedly reptile has been synonymous with one of Python, it is necessary for a lot less compared to Java. There are a lot of people do not know Java web crawler can do, in fact, Java Web crawler can do and can do very well, there are a lot of good Java web crawler framework of the open source community, such as webmagic. My first official job is to use webmagic write data acquisition program, was involved in the development of a public opinion analysis system, there involves a lot of news gathering sites, we use the webmagic be written collection program, because at that time did not know the design principle, use is still quite a few detours, in fact webmagic design draws Scrapy, so it can be as powerful as Scrapy, on webmagic framework we will start detailed discussions on the follow-up.

In the latter years of work, is also involved in several projects reptiles, but most are using Python, put aside the language aside, reptile is a set of ideas. In writing these crawlers, for me personally technical help grow very large, because in the course of reptiles, will encounter a variety of problems, in fact, is the test or do a web crawler technology, in addition to ensuring their collection procedures available outside, you will encounter a variety of sites to be crawled weird problems, such as the entire HTML page there is not a class or id attribute, you want to extract tabular data in this page, and do elegant extract, which is very test your imagination and technical friends. Very fortunate when it joined the line of contact with the web crawler this one, it speeds up my understanding and knowledge of the Internet, broaden my horizons.

In recent years more fire web crawler, if you want to learn Java web crawler, I summed up my own experience, like learning portal four-point basics Java web crawlers need to know.

1, the "morality" of reptiles

Why do I put this on the front of it? Because I think this is more important, what is called the "morality" of reptiles it? Is to follow the rules is to climb the server, not to affect the normal operation of the server being crawled, crawled not to bring down the service, which is the "morality" of reptiles.

It was a question often discussed is the reptile legal? Know almost what you see is this


The answer thousands and thousands in this great answer, I personally agree with the answer below

Reptile as a computer technology will determine its neutrality, and therefore reptile itself is not prohibited in law, but the use of crawler technology acquired data this behavior is a risk of illegal or even a crime. The so-called analyze specific issues, as a fruit knife itself is not prohibited in law, but to poke people, not to be tolerated by the law.

Reptile is not illegal? It depends on what you're doing is not illegal, what is the nature of the web crawler is? The nature of Web crawler to access pages instead of doing so is to use the machine. I view public news certainly is not illegal, so I went to gather public information on the Internet is not against the law, just like the major search engine sites, other sites do not anxious search engine spiders to crawl. Another case is contrary to collect data privacy of others, you go to see other people's private information that is a kind of illegal behavior, so the program is to collect illegal, it is like the answer to the said fruit knife itself is not illegal, but it is illegal to poke it.

To achieve the "morality" of reptiles, Robots protocol is that you must need to know, here is the Robots agreement Baidu Encyclopedia


In many sites will tell you stated Robots agreement which pages can crawl, crawl which pages are not, of course, Robots agreement is only a convention, the same seat as the bus marked with the old and sick special seat you sit up nor illegal.

In addition to the agreement, we also need to exercise restraint behavior acquisition, noted in Chapter II Article XVI "data security management approach (draft)" in:

Network operators take automated means to access the site to collect data, shall not impede the normal operation of the site; when such behavior seriously affect the operation of the site, such as automated access to the collection site daily traffic flow of more than one-third of the site to stop automated access to the collection, should be stopped .

This provision shall not prevent crawlers pointed out the site running, if you use the site crawlers ruined, real visitors can not access the site, and this is a very unethical. We should put an end to such behavior.

In addition to collecting data, also need to pay attention on the use of data, even when we collect personal information and data in the case of authorized, but also do not go selling personal data, this is prohibited by law otherwise indicated, see:

According to the provisions of Article V "Interpretations of the Supreme People's Court Supreme People's Procuratorate on the handling of personal information of citizens in criminal cases of violations of applicable law," the interpretation of "serious cases" of:

  • (1)非法获取、出售或者提供行踪轨迹信息、通信内容、征信信息、财产信息五十条以上的;
  • (2)非法获取、出售或者提供住宿信息、通信记录、健康生理信息、交易信息等其他可能影响人身、财产安全的公民个人信息五百条以上的;
  • (3)非法获取、出售或者提供第三项、第四项规定以外的公民个人信息五千条以上的便构成“侵犯公民个人信息罪”所要求的“情节严重”。
  • 此外,未经被收集者同意,即使是将合法收集的公民个人信息向他人提供的,也属于刑法第二百五十三条之一规定的“提供公民个人信息”,可能构成犯罪。

2、学会分析 Http 请求

我们每一次与服务端的交互都是通过 Http 协议,当然也有不是 Http 协议的,这个能不能采集我就不知道啦,没有采集过,所以我们只谈论 Http 协议,在 Web 网页中分析 Http 协议还是比较简单,我们以百度检索一条新闻为例

我们打开 F12 调试工具,点击 NetWork 查看版能查看到所有的请求,找到我们地址栏中的链接,主链接一般存在 NetWork 最上面一条链接


在右边headers查看栏中,我们能够看到这次请求所需要的参数,在这里我们需要特别注意 Request Headers 和 Query String Parameters 这两个选项栏。

Request Headers 表示的是该次 Http 请求所需要的请求头的参数,有一些网站会根据请求头来屏蔽爬虫,所以里面的参数还是需要了解一下的,请求头参数中大部分参数都是公用的, User-Agent 和 Cookie 这两个参数使用比较频繁, User-Agent 标识浏览器请求头,Cookie 存放的是用户登录凭证。

Query String Parameters 表示该次 Http 请求的请求参数,对于post 请求来说这个还是非常重要的,因为在这里可以查看到请求参数,对我们模拟登陆等 Post 请求非常有用。

上面是网页版的 HTTP 请求的链接分析,如果需要采集 APP 里面的数据就需要借助模拟器了,因为 APP 里没有调试工具,所以只能借助模拟器,使用较多的模拟器工具有如下两种,有兴趣的可以执行研究。

  • fiddler
  • wireshark

3、学会 HTML 页面解析

我们采集的页面都是 HTML 页面,我们需要在 HTML 页面中获取我们需要的信息,这里面就涉及到了 HTML 页面解析,也就是 DOM 节点解析,这一点是重中之重,如果你不会这一点就像魔术师没有道具一样,只能干瞪眼啦。例如下面这个 HTML 页面


我们需要获取标题 “java user-agent 判断是否电脑访问” ,我们先通过 F12 检查元素


标题所在的 span 标签我已经在图中框出来啦,我们该如何解析这个节点信息呢?方法有千千万万,经常使用的选择器应该是 CSS 选择器 和 XPath ,如果你还不知道这两种选择器,可以点击下方链接学习了解一下:

CSS 选择器参考手册:https://www.w3school.com.cn/cssref/css_selectors.asp

XPath 教程:https://www.w3school.com.cn/xpath/xpath_syntax.asp

使用 CSS 选择器解析的写法为:#wgt-ask > h1 > span

使用 XPath 解析的写法为://span[@class="wgt-ask"]

这样就获取到了 span 的节点,值需要取出 text 就好了,对于 CSS 选择器 和 XPath 除了自己编写之外,我们还可以借助浏览器来帮我们完成,例如 chrome 浏览器


只需要选中对应的节点,右键找到 Copy ,它提供了几种获取该节点的解析方式,具体的入上图所示,Copy selector 对应的就是 Css 选择器,Copy XPath 对应的是 XPath,这个功能还是非常有用的。

4、了解反爬虫策略

因为现在爬虫非常泛滥,很多网站都会有反爬虫机制,来过滤掉爬虫程序,以便保证网站的可以用,这也是非常有必要的手段,毕竟如果网站不能使用了,就没有利益可谈啦。反爬虫的手段非常多,我们来看看几种常见的反爬虫手段。

基于 Headers 的反爬虫机制

这是一种比较常见的反爬虫机制,网站通过检查 Request Headers 中的 User-Agent 、Referer 参数,来判断该程序是不是爬虫程序。要绕过这种机制就比较简单,我们只需要在网页中先查看该网站所需要的 User-Agent 、Referer 参数的值,然后在爬虫程序的 Request Headers 设置好这些参数就好啦。

基于用户行为的反爬虫机制

这也是一种常见的反爬虫机制,最常用的就是 IP 访问限制,一个 IP 在一段时间内只被允许访问多少次,如果超过这个频次的话就会被认为是爬虫程序,比如豆瓣电影就会通过 IP 限制。

对于这种机制的话,我们可以通过设置代理 IP 来解决这个问题,我们只需要从代理ip网站上获取一批代理ip,在请求的时候通过设置代理 IP 即可。

除了 IP 限制之外,还会有基于你每次的访问时间间隔,如果你每次访问的时间间隔都是固定的,也可能会被认为是爬虫程序。要绕过这个限制就是在请求的时候,时间间隔设置不一样,比例这次休眠 1 分钟,下次 30 秒。

基于动态页面的反爬虫机制

有很多网站,我们需要采集的数据是通过 Ajax 请求的或者通过 JavaScript生成的,对于这种网站是比较蛋疼的,绕过这种机制,我们有两种办法,一种是借助辅助工具,例如 Selenium 等工具获取渲染完成的页面。第二种方式就是反向思维法,我们通过获取到请求数据的 AJAX 链接,直接访问该链接获取数据。

以上就是爬虫的一些基本知识,主要介绍了网络爬虫的使用工具和反爬虫策略,这些东西在后续对我们的爬虫学习会有所帮助,由于这几年断断续续的写过几个爬虫项目,使用 Java 爬虫也是在前期,后期都是用 Python,最近突然间对 Java 爬虫又感兴趣了,所以准备写一个爬虫系列博文,重新梳理一下 Java 网络爬虫,算是对 Java 爬虫的一个总结,如果能帮助到想利用 Java 做网络爬虫的小伙伴,那就更棒啦。Java 网络爬虫预计会有六篇文章的篇幅,从简单到复杂,一步一步深入,内容涉及到了我这些年爬虫所遇到的所有问题。下面是模拟的六篇文章介绍。

1、网络爬虫,原来这么简单

这一篇是网络爬虫的入门,会使用 Jsoup 和 HttpClient 两种方式获取到页面,然后利用选择器解析得到数据。最后你会收获到爬虫就是一条 http 请求,就是这么简单。

2、网页采集遇到登录问题,我该怎么办?

这一章节简单的聊一聊获取需要登录的数据,以获取豆瓣个人信息为例,从手动设置 cookies 和模拟登陆这两种方式简单的聊一聊这类问题。

3、网页采集遇到数据 Ajax 异步加载,我该怎么办?

这一章节简单的聊一聊异步数据的问题,以网易新闻为例,从利用 htmlunit 工具获取渲染完页面和反向思维直接获取到 Ajax 请求连接获取数据两种方式,简单的聊一下这类问题的处理方式。

4、网页采集 IP 被封,我该怎么办?

IP 访问被限制这应该是常见的事情,以豆瓣电影为例,主要以设置代理IP为中心,简单的聊一聊 IP 被限制的解决办法,还会简单的聊一下如何搭建自己的ip代理服务。

5、网络采集性能太差,我该怎么办?

有时候对爬虫程序的性能有要求,这种单线程的方式可能就行不通了,我们可能就需要多线程甚至是分布式的爬虫程序啦,所以这一篇主要聊一聊多线程爬虫以及分布式爬虫架构方案。

6、开源爬虫框架 webmagic 使用案例解析

以前用 webmagic 做过一次爬虫,但是那个时候我似懂非懂的,没有很好的理解 webmagic 框架,经过这几年的经历,我现在对这个框架有了全新的认识,所以想按照 webmagic 的规范搭建一个简单的 demo来体验 webmagic 的强大之处。

以上就是我的爬虫系列博文,用文字记录点点滴滴,记录我这个非专业的爬虫玩家那些年遇到的坑,如果你准备学习 Java 网络爬虫,不妨关注一波,我相信你会得到一定的收获,因为我会用心的写好每一篇文章。

最后

打个小广告,欢迎扫码关注微信公众号:「平头哥的技术博文」,一起进步吧。
平头哥的技术博文

Guess you like

Origin www.cnblogs.com/jamaler/p/11621633.html