外文文献|基于点击流的并行网络爬虫

外文资料

A Clickstream-based Focused Trend Parallel Web Crawler

1. INTRODUCTION

The dimension of the World Wide Web is being expanded by an unpredictable speed. As a result, search engines encounter many challenges such as yielding accurate and up-to-date results to the users, and responding them in an appropriate timely manner. A centralized single-process crawler is a part of a search engine that traverses the Web graph and fetches any URLs from the initial or seed URLs, keeps them in a priority based queue and then in an iterated manner, according to an importance metric selects the first most important K URLs for further processing based on a version of Best-first algorithm. A parallel crawler on the other hand is a multi-processes crawler in which upon partitioning the Web into different segments, each parallel agent is responsible for crawling one of  the Web partitions  [9]. On the  other end of spectrum, all purpose unfocused crawlers attempt to search over the entire Web to construct their index while a focused crawler limits its function upon a semantic Web zone by selectively seeking out the relevant  pages to   pre-defined topic taxonomy as an effort to maintain a reasonable dimension of the index [8], [18].

The bottleneck in the performance of any crawler is applying an appropriate Web page importance metric in order to prioritize the crawl frontier. Since we are going to employ a clickstream-based metric as a heuristic, our hypothesis is the existence of a standard upon which the authorized crawlers have the right to access the server log files.

In continue, we first review on the literature of parallel crawlers, focused crawlers and the existed link-based and text-based Web page importance metrics by defining the drawbacks of each of them. Then, we briefly discuss our clickstream-based metric since it has been thoroughly discussed in a companion paper. Next, the application of the clickstream-based metric within the architecture of a focused parallel crawler which we call it CFP crawler will be presented.

2.   PARALLEL CRAWLRS

An appropriate architecture for a parallel crawler is the one in which the overlap occurrence of download pages among parallel agents is low. Besides, the coverage rate of downloaded pages within each parallel agent’s zone of responsibility is high. However, the quality of the overall parallel crawler or its ability to fetch the most important pages should not be less than that of a centralized crawler. For achieving these goals, a measure of information exchange is needed among parallel agents [9]. Although this communication  yields an  inevitable overhead, a satisfactory trade-off among these objectives should be taken into account for an optimized overall performance.

Selecting an appropriate Web partitioning function is  another issue of concern in parallel crawlers. The most dominant partitioning functions of the Web are the URL-hash-based, the site-hash-based and the hierarchical schemes. In URL-hash-based function, assignment of pages to each parallel agent is done according to the hash value of each URL. Under this scheme, different pages in a Web site are crawled by different parallel agents. In site-hash-based function, all pages in a Web site are assigned to one agent based on the hash value of the site name. In hierarchical scheme, partitioning the Web is performed according to the issues such as the geographic zone, language or the type of the URL extension [9]. Based on the definition above, designing a parallel crawler based on the site-hash-based partitioning function is reasonable with regard to the  locality retention of the  link structure and balanced size of the partitions.

Another issue of concern in the literature of parallel crawlers is the modes of job division among parallel agents. There are different modes of job division which are firewall, cross-over and exchange modes [9]. Under the first mode, each parallel agent only retrieves the pages inside its section and neglects those links pointed to the outside world. Under the second mode, a parallel agent primarily downloads the pages inside its partition and if the pages in its section have been finished, it follows inter-partition links. Under the exchange mode, parallel agents don’t follow the

inter-partition links. Instead, each parallel agent communicates with other agents to inform the corresponding agent of the existence of the inter-partition links points to pages inside their sections. Hence, a parallel crawler based on the exchange mode has no overlaps, has an acceptable coverage and has suitable quality, in addition to having a communication overhead for quality optimization.

3.   FOCUSED CRAWLRS

There are two different classes of crawlers known as focused and unfocused. The purpose of unfocused crawlers is to search over the entire Web to construct the index. As a result, they confront the laborious job of creating, refreshing and maintaining a database of great dimensions. While a focused crawler limits its function upon a semantic Web zone by selectively seeking out the relevant pages to predefined topic taxonomy and avoiding irrelevant Web regions as an effort to eliminate the irrelevant items among the search results and maintaining a reasonable dimensions of the index. A focused crawler’s notion of limiting the crawl boundary is fascinating  because “a recognition that covering a single galaxy can be more practical and useful than trying to cover the entire universe” [8].

The user information demands specification in a focused crawler is via importing exemplary Web documents instead of issuing queries. Therefore, a mapping process is performed by the system to highlight (a) topic(s) in the pre-existing topic tree which can be constructed based on human judgment [8]. The core elements of a traditional focused crawler are a classifier and a distiller sections. While the classifier checks the relevancy of each Web document’s content to the topic taxonomy based on the naïve Bayesian algorithm, the distiller finds hub pages inside the relevant Web regions by utilizing a modified version of HITS algorithm. These two components, together determine the priority rule for the existing URLs in a priority based queue of crawl frontier [8], [17].considering them as an entry point to some highly authorized content but it has no solution for another category of pages in dark net as unlinked pages which are the pages with few or no incoming links [2], [16]. So these pages never achieve a high PageRank score even if they contain authoritative content. Besides due to the fact that pages with high number of in-links mostly are older pages which over the time of existence on the Web they accumulate the links, hence these authoritative fresh Web content are disregarded under the PageRank  perspective [15].

The TimedPageRank algorithm adds the temporal dimension to the PageRank as an attempt to pay a heed to the newly uploaded high quality pages into the search result by considering a function of time f(t) ( 0 f (t) 1 ) in lieu of the damping factor d. The

notion of TimedPageRank is that a Web surfer at a page i has two options: First randomly choosing an outgoing link with the probability of f(ti) and second jumping to a random page without following a link with the probability of 1-f(ti). For a completely new page within a Web site, an average of the TimedPageRank of other pages in the Web site is used [25].

  In this paper we proposed an architecture for a focused structured parallel crawler (CFP Crawler) which employs a clickstreambased Web page importance metric. Besides in our approach,parallel agents collaborate with each other in the absence of a central coordinator in order to minimize the inevitable communication overhead.Our future work consists of more research to minimize the notification overhead to speed up the whole process and to run the crawler on the UTM University’s Web site. We intend to determine the precise value for the emphasis factor of E and the two balancing factors of α and β. Moreover, our research on the clickstream-based Web page importance metric is not finished since we are trying to make the metric more robust.Since the associated examples for each node (Dc*) in topic taxonomy tree is selected based on the higher PageRank score instead of selecting them randomly or considering the pages with high number of outgoing links [24], so in order to evaluate our CFP crawler, we will combine all second level seed URLs into one document in order to have a highly relevant Web source and then calculate the context relevancy of Web pages in result set to this document. Besides, our CFP crawler performance could be evaluated by using precision and harvest rate factors too.

中文翻译

基于点击流的并行网络爬虫

1.引言

随着互联网技术的飞速发展,很多搜索引擎遇到了挑战,比如如何及时得到准确和最新结果并能及时响应给用户。搜索引擎的一部分就是一个集中的单线程爬虫,爬虫遍历网络图和获取的所有的初始url或种子,首先创建一个有优先级机制的队列,然后以迭代的方式,根据重要性度量选择一个最重要的url进行进一步的爬取处理,总之策略是基于最佳优先算法的。而并行爬虫是一个基于多线程的爬虫,不同的线程处理不同的网络分区,每个并行线程负责代理爬行的一个网络分区[9]。另一方面,所有爬虫都是目的无重点性的,爬虫策略会为整个网络构建索引,聚焦爬虫限制其爬取对象(是特定的对象),有选择地爬取相关的页面并进行主题分类,最终保持保持一个合理的维度指数[8],[18]。

所有爬虫的应用瓶颈是将所爬取到的数据上加以重要性度量分析。因为我们要使用一个启发式并基于clickstream度量的爬虫,我们假设一般标准的爬虫是有权访问服务器日志文件的。

我们首先来研究研究文献中的并行爬虫,聚焦爬虫和存在性质以及基于文本的网页重要性指标它们各自的缺点。然后,我们简要讨论下clickstream-based指标,因为它已经在论文中讨论的很清楚了。接下来,以clickstream-based为度量的应用体系结构内的聚焦并行爬虫我们称之为CFP爬虫。

2.并行爬虫

并行爬虫的作为一个合适的体系结构是因为对同一个网站页面来说,重复爬取率很低。然而,整体上来说并行爬虫的爬取效率是应该不低于聚焦爬虫的爬取效率的。为实现这些目标,并行爬虫在爬取过程需要一定程度的信息共享[9]。虽然这样的信息共享会产生不可避免的开销,但是我们需要在系统的整体性能中做到平衡。

如何选择合适的网络分区是并行爬虫的一个很重要的问题。最主要的分区网络是基于URL-hash、site-hash和分级方案的。基于URL-hash的功能,页面分配给每个并行爬虫是依据每个URL的散列值。在这个方案下,不同的页面会由不同的并行爬虫处理。基于site-hash功能,页面分配给每个并行爬虫是依据网站的散列值。在分层计划中,做到网络分区需要根据地理区域、语言或URL的类型扩展等问题, [9]。在上面的定义的基础上,为了保留网站链接与数据结构,还有分区的平衡,设计一个基于site-hash的分区并行爬虫是合理必要的。

另一个让人担心的问题在并行爬虫在爬取工作进行的时候所遵循的分工模式。并行爬虫爬取需要有不同的分工,如交叉和交换模式[9]。第一个模式下,每个并行爬虫仅仅爬取网页内部的信息而忽视了指向页面外边的链接。第二个模式下,一个并行爬虫主要负责爬去所负责内部分区的链接,如果分区的链接均已爬取完毕,那么他会去遵循分区间的链接。在交换模式下,并行爬虫是不遵守分区间的链接的。相反,每个并行爬虫与其他爬虫通知相应的爬虫某个链接是属于哪个爬虫应该爬取的内容的。因此,基于交换模式的并行爬虫在一个可接受的范围是不会产生重复爬取页面的情况的,同时能保证爬取结果有合适的质量,并且优化了信息共享的开销。

一般爬虫有两种,聚焦爬虫和非聚焦的爬虫。非聚焦的爬虫程序的目的是在搜集整个互联网上的信息并建立索引。所以非聚焦爬虫面临艰苦的工作创建、更新和维护数据库的维度。而聚焦爬虫限制只会寻找与其聚焦主题有关系的网页链接进行爬取,从而避免无关的网络信息爬取,努力消除无关的网页链接以使搜索结果的索引能维持在一个合理的维度。聚焦爬虫限制爬取边界的概念是非常迷人的,因为“a recognition that covering a single galaxy can be more practical and useful than trying to cover the entire universe”[8]。

在用户需求信息说明书中,聚焦爬虫是通过引入示范性的网页文件,而不是发出查询。因此,映射过程是由系统执行以突出(一)个主题中,可以根据人的判断,可以构造预先存在的主题树[8]。传统的聚焦爬虫的核心元素是一个标识符和一个正文抽取处理部分。而分类器检查每个Web文档的内容,以基于朴素贝叶斯算法的主题分类的相关性, 正文抽取处理部分通过利用改进的HITS算法来判断相关的Web区域的内部轮廓页面。这两个组件是基于优先级爬取策略的,确定了现有的url的优先规则[8],[17]。考虑到它们作为切入点,一些高度授权的内容,但它没有解决的另一个问题是页面在缺少网链接的页面与很少或根本没有传入链接[2],[16]。所以这些页面没有达到高PR的分数,即使它们包含内容很权威。除了由于具有大量的链接的页面主要是旧的页面在网络上存在的时间他们积累的链接,因此这些权威的新的Web内容忽视PR的作用[15]。

TimedPageRank算法将时间维度添加到网页排名是为了能将最新的高质量的网站页面在搜索结果通过考虑时间的函数f(t)(0f(t)1)代替阻尼因子d。

TimedPageRank的概念是:一个互联网访问者在一个网站页面时有两种可能的情况:第一次的概率随机选择一个外向与f(ti)和第二跳一个随机页面后没有与行进的概率(ti)。一个全新的页面在一个网站的平均TimedPageRank同时会在网站的其他页面使用[25]。

在本文中,我们提出了一个架构集中结构化的并行网络爬虫(CFP爬虫),采用基于clickstream网页重要性度量。本文中的方法中,就是为并行网络爬虫彼此合作而提供了一个中央协调器,以减少不可避免的爬虫信息交互与共享的开销。我们以后的工作会带来更多的研究以最小化信息交互的开销开销从而加速并行爬虫爬取信息的整个过程。我们致力于确定E和α和β这几个平衡因素到底多大才是真正合适的。此外,我们的研究基于clickstream网页重要性指标到现在为止并没有完成,因为我们正在努力使指标更加精确。自相关的例子为每个节点(Dc = *)在选择主题分类树是基于PageRank的分数越高,而不是选择随机或考虑与大量的外部链接页面[24],为了评估我们的CFP爬虫,我们将所有二级url种子合并成一个文档,这样就有了一个高度相关的Web页面链接资源,然后再计算网页的上下文相关性结果集。除此之外,我们的CFP爬虫的性能可以通过使用精度和收获率等因素来评估。

猜你喜欢

转载自blog.csdn.net/BS009/article/details/130934366