Getting reptile 1 --- talk about web crawler

Getting reptile 1 --- talk about web crawler

2 --- Getting reptile reptiles framework webmagic

Getting real reptile reptilian 3 ---

1 to talk about web crawler

  1.1 What is a web crawler

       Web crawlers or web spiders, ants network, network robots, automation can browse information in the network, of course, browse information when needed in accordance with the rules we set, these rules we call web crawler algorithm. Using Python can easily write crawlers, automated retrieval of Internet information.

       Search engine spiders can not do without, such as Baidu search engine crawlers called Baidu spider (Baiduspider). Baidu spiders crawling the Internet daily flood of information, quality information and included crawling, when the user retrieves the corresponding keyword in the search engine Baidu, Baidu keyword analysis will identify from a collection of web pages related pages, sorted according to certain rules of the rankings and the results presented to the user.

       In this process, Baidu spider has played a crucial role. So, how to cover the Internet more high page? And how to filter these duplicate pages? These are determined by Baidu spider crawling algorithm. Using different algorithms, operating efficiency will be different reptiles crawling results will vary. Therefore, we studied reptiles of the time, not only to understand how the reptiles to achieve, you need to know some common reptiles algorithms, if necessary, we also need to own to develop appropriate algorithms, in this, we only need the concept of reptiles have a basic understanding.

        In addition to the Baidu search engine crawlers can not do without, other search engines can not be separated reptiles, they also have their own crawlers. For example, 360 of reptiles called 360Spider, Sogou reptiles called Sogouspider, Bing reptiles called Bingbot.

        If you want to own a small search engine to achieve, we can also write their own reptiles to realize, of course, although it may be less than the major search engines in terms of performance or algorithm, but the degree of personalization will be very high, and also help us to a deeper understanding of the inner workings of search engines.

        Big Data era is also inseparable from reptiles, such as during large data analysis or data mining, we can go to some of the larger official site to download the data source. However, these data sources is limited, then how can we get more and higher quality data source it? At this point, we can write your own reptiles procedures, obtain data information from the Internet.
 

     1.2 What do Web crawler

       Our initial understanding of the web crawler, the web crawler specifically what can we do? can be realised:

  • search engine
  • The era of big data, allows us to get more data sources
  • Rapid filling test and operational data
  • Provide training data set to AI

       Here the author in 2016 of a city tour of public opinion research system scenario as an example:

       As can be seen from the above tourism public opinion system, source data from relevant sites are crawled by web crawlers, after etl, the final output to the application layer, for hotels, resorts, tour buses and other public sentiment warning and intervention carried out. What if there is no reptiles, network data is simply not available, a huge manual entry, etc. costs. So reptiles in general as big data, search engine, artificial intelligence and other inlets, plays an irreplaceable role in today's Big Data, a wave of artificial intelligence.

     1.3 Web crawler common technique (Java)

         1.3.1 underlying implementation HttpClient + Jsoup

       HttpClient is Apache Jakarta Common subproject, used to provide efficient, new, feature-rich support HTTP protocol client-side programming toolkit, and it supports HTTP latest version of the protocol and recommendations. HttpClient has been used in many projects, such as Apache Jakarta on the other two very well-known open source projects Cactus and HTMLUnit use the HttpClient . For more information, please visit http://hc.apache.org/ jsoup is a Java 's HTML parser can parse a direct URL addresses, HTML text. It provides a very labor-saving of the API , via the DOM , the CSS and the like jQuery is taken out of operation method and operation data.

         1.3.2 open source framework Webmagic

       webmagic is an open source Java reptiles framework, the goal is to simplify the development process reptiles, allowing developers to focus on the development of logic functions. webmagic core is very simple, but covering the entire process of reptiles, reptile is a good learning materials development.

          

      webmagic main features:

  • Completely modular design, powerful scalability.
  • The core is simple but covers all reptiles processes, flexible and robust, but also learning good material for getting started reptiles.
  • Extract pages provide a rich API .
  • No configuration, but can POJO + implemented in the form of a crawler annotation.
  • Support multi-threading.
  • Supports distributed.
  • Support crawling js dynamic page rendering.
  • Frameless dependency, can be embedded into a flexible project.    

 

Published 41 original articles · won praise 47 · views 30000 +

Guess you like

Origin blog.csdn.net/u014526891/article/details/102690148
Recommended