Crawler web crawler learning 03.Python first bomb "Python web crawler relevant basic concepts"

Crawler web crawler learning 03.Python first bomb "Python web crawler relevant basic concepts"

Introduction reptiles

Introduced

Before teaching the course, many students have asked me this question: Why learn reptiles, reptile learning can bring those benefits to our future development? In fact, the reason for studying reptiles and benefits for our future development brings are obvious, from employment or whether it is from the actual application.

We all know that the current times we live in the era of big data, in the era of big data, to analyze the data, we must first have a data source, and learning reptiles, allows us to get more data sources, and data source It can be collected by our purpose.

Youku launched the Mars Intelligence Agency Web Crawler is based on data and analysis of finished. In which each phase of the program topics are crawling relevant data from relevant popular interactive platform, and then crawling to the data obtained by the data analysis. On the other hand, according to Youku when the user real-time viewing video forward, backward and other behavioral data can be estimated to calculate the point of the audience's interest and hobbies point, which helps write programs and program editing programs later.

Today's news headlines as a recommended application class, its internal data news news data are carried out in various news sites by crawlers crawling, then by appropriate processing and computing news topics of interest to the user pushed to the user's mobile phone on.

From the point of view of employment, the engineers at present belong to the reptile shortage of personnel and salary generally higher so deep to master this skill, for employment, it is very beneficial. Some people might learn reptiles for employment or change jobs. From this perspective, reptiles engineer is one good choice. With the advent of the era of big data, the application crawler technology will be more extensive in the future will have more space for development.

Overview Today

  • Reptile Profile
  • Reptile classification
  • robots protocol
  • Anti-climbing mechanism
  • Anti-anti-climbing mechanism

Details today

  • What is a reptile

    Reptile is simulated by programming the Internet browser, and then let the process go grab data on the Internet.

  • What languages ​​can be implemented reptiles

    1.php : reptiles can be achieved. php is known as the most beautiful languages in the world (of course, is known as its own, is blowing his meaning), but php support multi-threaded and multi-process aspects do well in achieving crawlers.

    2.java : reptiles can be achieved. java can achieve very good processing and reptiles, is the only way to keep pace with the python and the python is the number one rival. However, the code is more bloated crawler java implementation, a large reconstruction cost.

    3.c, c ++ : reptiles can be achieved. But this way achieve reptile is purely a manifestation of some people (bigwigs) capability, is not wise and reasonable choice.

    4.python : reptiles can be achieved. python implementation and handling reptiles grammar simple, elegant code, support for a wide range of modules, low-cost learning, with a very powerful framework (scrapy, etc.) and an indescribable good! But no!

  • Classification of reptiles

    1.-class Reptilia:

    General crawler is an important part of search engine (Baidu, Google, Yahoo, etc.) to "grab system". The main purpose of the web page on the Internet is downloaded to the local, mirroring a form of Internet content. Simply put, that is, as much as possible; all the pages on the Internet downloaded, put the local server where Backup form, making the handling of these related pages (extract keywords, remove the ad), and finally provides a user search interface.

    • How search engines crawl the site data on the Internet?
      • Initiative to provide its portal website url to the search engine company
      • Search engine companies and DNS service provider, acquired the site url
      • Portal initiative anchored in links some well-known website

    2. focused crawler: focused crawler to crawl is specified on the network data according to the specified requirements. For example: Get the name and watercress on a movie critic, instead of getting all the data values the entire page.

  • robots.txt protocol

    - If the data specified page in your portal you do not want the crawler to crawl, then you may be constrained by protocol data crawler write a robots.txt file crawling. Written protocol format robots can be observed Taobao robots (www.taobao.com/robots.txt can access). However, note that this agreement is only the equivalent of an oral agreement, and without the use of related technologies mandatory controls, so the agreement is not anti-anti-gentleman villain. But we've written in the learning stage reptiles crawler robots can ignore the first agreement.

  • Anti reptile

    - portal through appropriate policy and technical means to prevent crawlers crawling the site data.

  • Fanfan reptiles

    - crawler through appropriate strategies and techniques, anti-reptile portal means to crack, so crawling to the corresponding data.

Guess you like

Origin www.cnblogs.com/bky20061005/p/12172309.html