What is Python Reptile

As a programmer, I believe we no stranger to "reptile" word around often someone will mention the word, in the eyes of people do not understand it, will find this a very high-end technology is very mysterious. Do not worry, we'll take you to the reptile family that opened its mystery, to explore its true face.

What are reptiles

Web crawler (also known as web spider, web robot), is a kind of follow certain rules, automatically grab information on the World Wide Web program or script. Other less frequently used names include ants, automatic indexing, simulation programs or worms.

More simply, we make a big spider web over the Internet, each site resource for more than one node spider web, reptiles like a spider, to find the target node in this spider web design in accordance with the rules of good line and ,Access to resources.

Why reptile

Why do we need to use the reptile it?

We can imagine a scenario: You really look up a micro-blog celebrity, his microblogging fascinated, you want him decades excerpt on the microblogging every word down and made into a celebrity quotes. This time how do you do it? To manually Ctrl + C and Ctrl + V do? This approach is indeed true, a small amount of data when we can do so, but the data thousands of times you do it?

Let's imagine another scenario: You have to do a news aggregation site, the need for regular day to several news sites for the latest news, we called RSS feeds. Do you regularly subscribe to various news websites to copy it? I am afraid very difficult for individuals to do it yourself.

The above two scenarios, using a crawler technology can solve the problem easily. So, we can see that the main crawler technology can help us do two kinds of things: one is the demand for data acquisition, mainly to obtain information for a large amount of data under specific rules; the other is automation needs, mainly used in similar information aspect polymerization search.

Classification of reptiles

From the point of view objects crawling, crawler and the crawler can be divided into general categories focused crawler.

General Web crawler, also known as network-wide crawler (Scalable Web Crawler), crawling objects from some of the seed URL expanded to the entire Web, mainly for large Web search engines and service providers to collect data. Great range and number of such crawling web crawler, for high creep speed and storage space requirements, the order for crawling pages requirements are relatively low. For example, our common Baidu and Google search. We enter keywords, they will look for keywords related to the page from the whole network, and presented to us in a certain order.

Focus crawler (Focused Crawler), means for selectively crawling those predefined page topic web crawler. And general web crawler compared to just focus reptiles crawling specific pages, crawling breadth will be much smaller. For example, we need crawling fund data Eastern wealth network, we only need to develop rules for page dongfangcaifuwang crawling on the line.

More simply, common reptile is similar to a spider, need to find a particular food, but it does not know which node has a spider web, so it can only start looking from a node, the node will meet to look at, if there is food to get food, if the node has a node indicating a certain food, then follow the instructions to find it the next node. The focus is the spider web crawlers know which nodes have food, it only need to plan your route to get to that node will be able to get food.

Web browsing process

Users browse the web in the process, we might see a lot of nice pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, a picture like this:

baidu_pic_index

This process is actually after users enter the URL, after a DNS server, find the server host, sends a request to the server, the server after resolution, sent to the user's browser HTML, JS, CSS and other documents, browser parses out, users will be able to We see all kinds of pictures.

Therefore, the user can see the page is essentially composed of HTML code, climbing to the reptiles is the content, access to pictures, text and other resources by analyzing and filtering the HTML code to achieve.

Meaning the URL

URL, or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet. Each file on the Internet has a unique URL, which contains information indicates that the file location and the browser should be how to deal with it.

URL format consists of three parts:

  • The first part is the protocol (or referred to as service mode).
  • There is a second portion of the resource host IP address (sometimes also including a port number).
  • The third part is the host address specific resources, such as directories and file names.

由于爬虫的目标是获取资源,而资源都存储在某个主机上,所以爬虫爬取数据时必须要有一个目标的 URL 才可以获取数据,因此,它是爬虫获取数据的基本依据,准确理解它的含义对爬虫学习有很大帮助。

爬虫的流程

我们接下来的篇章主要讨论聚焦爬虫,聚焦爬虫的工作流程如下图:

spider_flow

  • 首先我们需要有一个种子 URL 队列,这个队列中的 URL 相当于我们蜘蛛爬行的第一个结点,是我们在大网中爬行的第一步。
  • 对队列中的每一个 URL 进行请求,我们会得到响应内容,通常响应内容为HTML。如果响应内容里面有我们的目标 URL,提取出来加入 URL 队列中。
  • 解析响应内容,提取我们需要的数据。
  • 存储数据,我们可以将数据存储到数据库、文件等。

从这个爬虫的流程来看,大家应该能够联想到学习爬虫需要学习的关键步骤。首先我们需要像浏览器一样请求某个 URL ,来获取某个主机的资源,那么请求的方法和正确地获取内容就是我们学习的重点。我们获取到资源(也就是请求 URL 之后获得的响应内容)之后,我们需要对响应的内容进行解析,从而获取到对我们有价值的数据,这里面的解析方法就是学习的重点了。我们获取到数据之后,接下来就需要存储数据了,数据的存储方法也很重要。

所以我们学习的爬虫技术,实际上可以归纳为请求解析存储三个基本问题。熟练掌握这三个问题对应的解决方法,爬虫技术就算是掌握了。大家在学习爬虫的过程中,紧紧围绕这三个问题展开,就不会走弯路了。

总结

本节给大家介绍了爬虫的基本概念,让大家对爬虫有一个大致的了解,以便后续章节的学习。开胃菜吃完了

Guess you like

Origin www.cnblogs.com/7758520lzy/p/12155823.html