如何创建Web爬虫

网络爬虫是一种下载网页的计算机应用程序，然后遵循该网页上的所有链接并下载它们。网络爬虫用于存储网站以供离线阅读，或用于将数据库中的网页存储为搜索引擎使用。创建Web爬虫是一项具有挑战性的任务，适合于大学级的编程课程。这些说明假设您具有可靠的编程经验，但不了解爬虫架构。这些步骤为使用您选择的语言编写Web爬虫提供了一个非常具体的架构。

用您希望下载的初始网页初始化您的程序。将此页面的URL添加到新的URL数据库表中。

向Web浏览器发送一条命令，指示它获取此网页并将其保存到磁盘。将数据库指针向前移动一步，越过刚下载的URL，现在将指向表的末尾。

将网页读入该程序，并解析它以获取更多网页的链接。这通常通过搜索文本字符串“http：//”并捕获该字符串与终止字符（例如“”，“。”或“>”）之间的文本来完成。将这些链接添加到URL数据库表中; 数据库指针应该保留在这个新列表的顶部。

测试数据库表中的条目是否唯一，并删除出现多次的URL。

如果您希望应用网址过滤器（例如，为了防止从不同网域的网站下载网页），请立即将其应用到网址数据库表中，并删除您不希望下载的任何网址。

设置一个编程循环，让你的爬虫回到上面的第2步。这将递归下载您的爬虫遇到的所有URL。移除重复的网址可确保爬虫在到达最后一个唯一网址时正确终止。

原文

How to Create a Web Spider

by Ellis Davidson

A web spider is a computer application that downloads a web page, and then follows all of the links on that page and downloads them as well. Web spiders are used to store websites for offline reading, or for storage of web pages in databases to be used by a search engine. Creating a Web spider is a challenging task, suitable for a college-level programming class. These instructions assume you have solid programming experience but no knowledge of spider architecture. The steps lay out a very specific architecture for writing a Web spider in your chosen language.

Initialize your program with the initial web page you wish to download. Add the URL for this page to a new database table of URLs.

Send a command to the web browser instructing it to fetch this web page, and save it to a disk. Move the database pointer forward one step past the URL you just downloaded, which will now point to the end of the table.

Read the web page into the program, and parse it for links to additional web pages. This is typically done by searching for the text string "http://," and capturing the text between that string and a termination character (such as " ", ".", or ">"). Add these links to the URL database table; the database pointer should remain on top of this new list.

Test the entries in the database table for uniqueness, and remove any URLs that appear more than once.

If you wish to apply a URL filter (for example, to prevent downloading pages from sites at different domains), apply it now to the URL database table and remove any URLs you do not wish to download.

Set up a programmatic loop so your spider returns to step 2 above. This will recursively download all of the URLs your spider encounters. Removing duplicate URLs ensures that the spider will properly terminate when it reaches the last unique URL.

原文地址：https://itstillworks.com/create-spider-5899924.html