Many people have seen and used the reptile, but few people know how these reptiles concept!

As a programmer, I believe we no stranger to "reptile" word around often someone will mention the word, in the eyes of people do not understand it, will find this a very high-end technology is very mysterious. Do not worry, we'll take you to the reptile family that opened its mystery, to explore its true face.

What are reptiles

Web crawler (also known as web spider, web robot), is a kind of follow certain rules, automatically grab information on the World Wide Web program or script. Other less frequently used names include ants, automatic indexing, simulation programs or worms.

More simply, we make a big spider web over the Internet, each site resource for more than one node spider web, reptiles like a spider, to find the target node in this spider web design in accordance with the rules of good line and ,Access to resources.

Why reptile

Why do we need to use the reptile it?

We can imagine a scenario: You really look up a micro-blog celebrity, his microblogging fascinated, you want him decades excerpt on the microblogging every word down and made into a celebrity quotes. This time how do you do it? To manually Ctrl + C and Ctrl + V do? This approach is indeed true, a small amount of data when we can do so, but the data thousands of times you do it?

Let's imagine another scenario: You have to do a news aggregation site, the need for regular day to several news sites for the latest news, we called RSS feeds. Do you regularly subscribe to various news websites to copy it? I am afraid very difficult for individuals to do it yourself.

The above two scenarios, using a crawler technology can solve the problem easily. So, we can see that the main crawler technology can help us do two kinds of things: one is the demand for data acquisition, mainly to obtain information for a large amount of data under specific rules; the other is automation needs, mainly used in similar information aspect polymerization search.

Classification of reptiles

From the point of view objects crawling, crawler and the crawler can be divided into general categories focused crawler.

General Web crawler, also known as network-wide crawler (Scalable Web Crawler), crawling objects from some of the seed URL expanded to the entire Web, mainly for large Web search engines and service providers to collect data. Great range and number of such crawling web crawler, for high creep speed and storage space requirements, the order for crawling pages requirements are relatively low. For example, our common Baidu and Google search. We enter keywords, they will look for keywords related to the page from the whole network, and presented to us in a certain order.

Focus crawler (Focused Crawler), means for selectively crawling those predefined page topic web crawler. And general web crawler compared to just focus reptiles crawling specific pages, crawling breadth will be much smaller. For example, we need crawling fund data Eastern wealth network, we only need to develop rules for page dongfangcaifuwang crawling on the line.

More simply, common reptile is similar to a spider, need to find a particular food, but it does not know which node has a spider web, so it can only start looking from a node, the node will meet to look at, if there is food to get food, if the node has a node indicating a certain food, then follow the instructions to find it the next node. The focus is the spider web crawlers know which nodes have food, it only need to plan your route to get to that node will be able to get food.

Web browsing process

Users browse the web in the process, we might see a lot of nice pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, a picture like this:

baidu_pic_index

This process is actually after users enter the URL, after a DNS server, find the server host, sends a request to the server, the server after resolution, sent to the user's browser HTML, JS, CSS and other documents, browser parses out, users will be able to We see all kinds of pictures.

Therefore, the user can see the page is essentially composed of HTML code, climbing to the reptiles is the content, access to pictures, text and other resources by analyzing and filtering the HTML code to achieve.

Meaning the URL

URL, or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet. Each file on the Internet has a unique URL, which contains information indicates that the file location and the browser should be how to deal with it.

URL format consists of three parts:

  • The first part is the protocol (or referred to as service mode).
  • There is a second portion of the resource host IP address (sometimes also including a port number).
  • The third part is the host address specific resources, such as directories and file names.

As the reptile's goal is to acquire resources, and resources are stored on a host, it must have a goal to climb reptiles when fetch URL before they can get the data, therefore, it is the fundamental basis for crawlers access to data, an accurate understanding of it the meaning of great help to study reptiles.

Reptile process

Our next chapter focuses on focused crawler, focused crawler workflow as shown below:

spider_flow

  • First, we need to have a seed URL queue, the queue URL corresponds to the first node we spiders crawling, crawling is the first step in a large network.
  • Queue each URL request, we will get a response content, usually in response to the content as HTML. If the response content inside our target URL, URL extracted join the queue.
  • Parse the response content, extract the data we need.
  • Data storage, we can store data to the database files.

The reptile from the process point of view, we should be able to think of a key step in learning to learn reptiles. First, we need like a web browser requests a URL, to acquire a host of resources, then the request method and proper access to content that is the focus of our study. After (that is, after obtaining the URL request response content) we get the resources we need to respond to the contents of the resolution, in order to obtain valuable data to us, and there's analytical approach is the focus of the study. Then we get to the data, then you need to store data, and data storage method is also very important.

So we learn crawler technology, in fact, can be summarized as request , parse and store three basic questions. Master solution corresponding to these three issues, reptiles even mastered the technology. Everyone in the process of learning in reptiles, focus on these three issues launched, it will not detours.

对Python感兴趣或者是正在学习的小伙伴,可以加入我们的Python学习扣qun:784758214,从0基础的python脚本到web开发、爬虫、django、数据挖掘数据分析等,0基础到项目实战的资料都有整理。送给每一位python的小伙伴!每晚分享一些学习的方法和需要注意的小细节,学习路线规划,利用编程赚外快。点击加入我们的 python学习圈

总结

本节给大家介绍了爬虫的基本概念,让大家对爬虫有一个大致的了解,以便后续章节的学习。

发布了81 篇原创文章 · 获赞 3 · 访问量 9439

Guess you like

Origin blog.csdn.net/NNNJ9355/article/details/104011515