Translation of Foreign Language Documents | Parallel Web Crawler Based on Clickstream

foreign language materials

A Clickstream-based Focused Trend Parallel Web Crawler

1. INTRODUCTION

The dimension of the World Wide Web is being expanded by an unpredictable speed. As a result, search engines encounter many challenges such as yielding accurate and up-to-date results to the users, and responding them in an appropriate timely manner. A centralized single-process crawler is a part of a search engine that traverses the Web graph and fetches any URLs from the initial or seed URLs, keeps them in a priority based queue and then in an iterated manner, according to an importance metric selects the first most important K URLs for further processing based on a version of Best-first algorithm. A parallel crawler on the other hand is a multi-processes crawler in which upon partitioning the Web into different segments, each parallel agent is responsible for crawling one of  the Web partitions  [9]. On the  other end of spectrum, all purpose unfocused crawlers attempt to search over the entire Web to construct their index while a focused crawler limits its function upon a semantic Web zone by selectively seeking out the relevant  pages to   pre-defined topic taxonomy as an effort to maintain a reasonable dimension of the index [8], [18].

The bottleneck in the performance of any crawler is applying an appropriate Web page importance metric in order to prioritize the crawl frontier. Since we are going to employ a clickstream-based metric as a heuristic, our hypothesis is the existence of a standard upon which the authorized crawlers have the right to access the server log files.

In continue, we first review on the literature of parallel crawlers, focused crawlers and the existed link-based and text-based Web page importance metrics by defining the drawbacks of each of them. Then, we briefly discuss our clickstream-based metric since it has been thoroughly discussed in a companion paper. Next, the application of the clickstream-based metric within the architecture of a focused parallel crawler which we call it CFP crawler will be presented.

2.   PARALLEL CRAWLRS

An appropriate architecture for a parallel crawler is the one in which the overlap occurrence of download pages among parallel agents is low. Besides, the coverage rate of downloaded pages within each parallel agent’s zone of responsibility is high. However, the quality of the overall parallel crawler or its ability to fetch the most important pages should not be less than that of a centralized crawler. For achieving these goals, a measure of information exchange is needed among parallel agents [9]. Although this communication  yields an  inevitable overhead, a satisfactory trade-off among these objectives should be taken into account for an optimized overall performance.

Selecting an appropriate Web partitioning function is  another issue of concern in parallel crawlers. The most dominant partitioning functions of the Web are the URL-hash-based, the site-hash-based and the hierarchical schemes. In URL-hash-based function, assignment of pages to each parallel agent is done according to the hash value of each URL. Under this scheme, different pages in a Web site are crawled by different parallel agents. In site-hash-based function, all pages in a Web site are assigned to one agent based on the hash value of the site name. In hierarchical scheme, partitioning the Web is performed according to the issues such as the geographic zone, language or the type of the URL extension [9]. Based on the definition above, designing a parallel crawler based on the site-hash-based partitioning function is reasonable with regard to the  locality retention of the  link structure and balanced size of the partitions.

Another issue of concern in the literature of parallel crawlers is the modes of job division among parallel agents. There are different modes of job division which are firewall, cross-over and exchange modes [9]. Under the first mode, each parallel agent only retrieves the pages inside its section and neglects those links pointed to the outside world. Under the second mode, a parallel agent primarily downloads the pages inside its partition and if the pages in its section have been finished, it follows inter-partition links. Under the exchange mode, parallel agents don’t follow the

inter-partition links. Instead, each parallel agent communicates with other agents to inform the corresponding agent of the existence of the inter-partition links points to pages inside their sections. Hence, a parallel crawler based on the exchange mode has no overlaps, has an acceptable coverage and has suitable quality, in addition to having a communication overhead for quality optimization.

3.   FOCUSED CRAWLRS

There are two different classes of crawlers known as focused and unfocused. The purpose of unfocused crawlers is to search over the entire Web to construct the index. As a result, they confront the laborious job of creating, refreshing and maintaining a database of great dimensions. While a focused crawler limits its function upon a semantic Web zone by selectively seeking out the relevant pages to predefined topic taxonomy and avoiding irrelevant Web regions as an effort to eliminate the irrelevant items among the search results and maintaining a reasonable dimensions of the index. A focused crawler’s notion of limiting the crawl boundary is fascinating  because “a recognition that covering a single galaxy can be more practical and useful than trying to cover the entire universe” [8].

The user information demands specification in a focused crawler is via importing exemplary Web documents instead of issuing queries. Therefore, a mapping process is performed by the system to highlight (a) topic(s) in the pre-existing topic tree which can be constructed based on human judgment [8]. The core elements of a traditional focused crawler are a classifier and a distiller sections. While the classifier checks the relevancy of each Web document’s content to the topic taxonomy based on the naïve Bayesian algorithm, the distiller finds hub pages inside the relevant Web regions by utilizing a modified version of HITS algorithm. These two components, together determine the priority rule for the existing URLs in a priority based queue of crawl frontier [8], [17].considering them as an entry point to some highly authorized content but it has no solution for another category of pages in dark net as unlinked pages which are the pages with few or no incoming links [2], [16]. So these pages never achieve a high PageRank score even if they contain authoritative content. Besides due to the fact that pages with high number of in-links mostly are older pages which over the time of existence on the Web they accumulate the links, hence these authoritative fresh Web content are disregarded under the PageRank  perspective [15].

The TimedPageRank algorithm adds the temporal dimension to the PageRank as an attempt to pay a heed to the newly uploaded high quality pages into the search result by considering a function of time f(t) ( 0 f (t) 1 ) in lieu of the damping factor d. The

notion of TimedPageRank is that a Web surfer at a page i has two options: First randomly choosing an outgoing link with the probability of f(ti) and second jumping to a random page without following a link with the probability of 1-f(ti). For a completely new page within a Web site, an average of the TimedPageRank of other pages in the Web site is used [25].

  In this paper we proposed an architecture for a focused structured parallel crawler (CFP Crawler) which employs a clickstreambased Web page importance metric. Besides in our approach,parallel agents collaborate with each other in the absence of a central coordinator in order to minimize the inevitable communication overhead.Our future work consists of more research to minimize the notification overhead to speed up the whole process and to run the crawler on the UTM University’s Web site. We intend to determine the precise value for the emphasis factor of E and the two balancing factors of α and β. Moreover, our research on the clickstream-based Web page importance metric is not finished since we are trying to make the metric more robust.Since the associated examples for each node (Dc*) in topic taxonomy tree is selected based on the higher PageRank score instead of selecting them randomly or considering the pages with high number of outgoing links [24], so in order to evaluate our CFP crawler, we will combine all second level seed URLs into one document in order to have a highly relevant Web source and then calculate the context relevancy of Web pages in result set to this document. Besides, our CFP crawler performance could be evaluated by using precision and harvest rate factors too.

Chinese translation

Parallel web crawler based on clickstream

1 Introduction

With the rapid development of Internet technology, many search engines have encountered challenges, such as how to obtain accurate and latest results and respond to users in a timely manner. Part of the search engine is a centralized single-threaded crawler. The crawler traverses the network graph and obtains all the initial urls or seeds, first creates a queue with a priority mechanism, and then iteratively selects the most important one according to the importance metric. The url for further crawling processing, in short, the strategy is based on the best priority algorithm. The parallel crawler is a crawler based on multi-threading, different threads handle different network partitions, and each parallel thread is responsible for proxy crawling a network partition [9]. On the other hand, all crawlers are purposeless. The crawler strategy will build an index for the entire network, focus the crawler to limit its crawling objects (specific objects), and selectively crawl related pages and classify them. Ultimately maintain a reasonable dimensionality index [8], [18].

The application bottleneck of all crawlers is to measure and analyze the importance of the crawled data. Since we are going to use a heuristic crawler based on clickstream metrics, we assume that normal standard crawlers have access to server log files.

We first examine parallel crawling in the research literature, focusing on crawler and existential properties and their respective shortcomings in text-based web page importance indicators. Then, we briefly discuss the clickstream-based metric, as it has been discussed clearly in the paper. Next, the focused parallel crawler within the clickstream-based application architecture is called CFP crawler.

2. parallel crawler

Parallel crawling is an appropriate architecture because of the low rate of repeated crawling of the same website pages. However, on the whole, the crawling efficiency of parallel crawlers should not be lower than that of focused crawlers. To achieve these goals, parallel crawlers need a certain degree of information sharing during the crawling process [9]. Although such information sharing will generate unavoidable overhead, we need to strike a balance in the overall performance of the system.

How to choose an appropriate network partition is a very important issue for parallel crawlers. Mostly partitioned networks are based on URL-hash, site-hash and hierarchical schemes. Based on the URL-hash function, pages are assigned to each parallel crawler based on the hash value of each URL. Under this scheme, different pages are processed by different parallel crawlers. Based on the site-hash function, pages are assigned to each parallel crawler based on the site's hash value. In the layered plan, to achieve network partitioning needs to be extended according to the geographical area, language or type of URL, etc., [9]. On the basis of the above definition, in order to preserve the website link and data structure, as well as the balance of partitions, it is reasonable and necessary to design a partitioned parallel crawler based on site-hash.

Another worrying issue is the division of labor that parallel crawlers follow when crawling. Parallel crawler crawling requires different division of labor, such as crossover and exchange patterns [9]. In the first mode, each parallel crawler only crawls the information inside the webpage and ignores the links pointing to the outside of the page. In the second mode, a parallel crawler is mainly responsible for crawling the links of the internal partitions it is responsible for. If the links of the partitions have been crawled, then it will follow the links between the partitions. In swap mode, parallel crawlers do not respect the links between partitions. Instead, each parallel crawler informs the corresponding crawler with other crawlers which link belongs to which crawler should crawl the content. Therefore, the parallel crawler based on the exchange mode will not produce repeated page crawling in an acceptable range, and at the same time, it can ensure that the crawling results have an appropriate quality, and optimize the cost of information sharing.

There are generally two types of crawlers, focused crawlers and non-focused crawlers. The purpose of unfocused crawlers is to collect and index information from the entire Internet. So non-focused crawlers face the hard work of creating, updating and maintaining the dimensions of the database. Focused crawlers only search for web links related to their focused topics to crawl, so as to avoid crawling irrelevant network information, and strive to eliminate irrelevant web links so that the index of search results can be maintained in a reasonable dimension. The concept of a focused crawler limiting crawling boundaries is fascinating because “a recognition that covering a single galaxy can be more practical and useful than trying to cover the entire universe” [8].

In the user requirement information specification, the crawler is focused on importing exemplary web documents rather than issuing queries. Therefore, the mapping process is performed by the system in order to highlight (a) a topic, which can be constructed according to human judgment from a pre-existing topic tree [8]. The core elements of a traditional focused crawler are an identifier and a text extraction processing part. While the classifier checks the content of each Web document to classify the relevance of the topic based on the Naive Bayesian algorithm, the text extraction processing part uses the improved HITS algorithm to judge the internal outline of the relevant Web area page. These two components are based on the priority crawling strategy, which determines the priority rules of the existing URLs [8], [17]. Consider them as entry points for some highly authoritative content, but another problem it doesn't address is pages with little or no incoming links [2], [16]. So these pages do not achieve high PR scores even though they contain authoritative content. Apart from the fact that pages with a large number of links are mainly old pages that accumulated links during the time they existed on the web, these authoritative new web content ignores the role of PR [15].

The TimedPageRank algorithm adds the time dimension to the webpage ranking in order to replace the damping factor d with the function f(t)(0f(t)1) of time by considering the latest high-quality website pages in the search results.

The concept of TimedPageRank is that an Internet visitor has two possible situations when he or she is on a website page: the first time the probability of randomly selecting an outward direction with f(ti) and the second time after jumping to a random page there is no probability of going forward with f(ti ). The average TimedPageRank of a brand new page on a website while being used on other pages of the website [25].

In this paper, we propose an architecture for a centrally structured parallel web crawler (CFP crawler), using a clickstream-based web page importance metric. In the method in this paper, a central coordinator is provided for parallel web crawlers to cooperate with each other, so as to reduce the inevitable overhead of crawler information interaction and sharing. Our future work will bring more research to minimize the overhead of information interaction to speed up the entire process of crawling information with parallel crawlers. We worked on determining how large E and the balancing factors of α and β are really appropriate. Also, our research on the clickstream-based webpage importance metric has not been completed so far, as we are working on making the metric more precise. An example of autocorrelation for each node (Dc = *) in the selected topic classification tree is based on the higher PageRank score, rather than selecting randomly or considering pages with a large number of external links [24]. In order to evaluate our CFP crawler, we Merge all secondary url seeds into one document, so that there is a highly relevant web page link resource, and then calculate the contextual relevance result set of the web page. Besides, the performance of our CFP crawler can be evaluated by using factors such as accuracy and harvest rate.

Guess you like

Origin blog.csdn.net/whirlwind526/article/details/130934346