High-performance focused crawling in the Web is especially challenging because the level of interest in a topic varies significantly over time. These variations have a strong impact on crawling performance: an inapt engine will spend a great deal of time downloading pages that have no relation to what it was searching for, while a fast engine will spend much less time on these "wasted" downloads. Many important technical problems, such as content-based queries, trust based queries, de-duplication and relevance judgments, have been studied in the context of focused Web crawling.
Today, there are two main types of focused crawlers: topical and focused. The main difference between them is the instantaneity with which the former can choose pages from the Web while the latter must wait to process the Pages and visit their links before they are able to choose the next page. In this section, we will discuss three approaches that can be used to design the topographic engine: based on a hidden Markov model of the corpus, on the content of a page or on the anchor text of its links.
Content-based focused-crawling was pioneered by Soumen Chakrabarti et al. but the method we describe here is most closely related to the work of the Stanford University CS Dept., led by Florian Rabe & Andreas Stolte. The main idea of the latter is that the similarity of the web page retrieved by the crawler to every topic in a collection can be estimated by taking the similarity between the topic text of the page and the topic text of the collection.
Chakrabarti et al.'s approach is based on selecting the top-mostterms of the web pages that were visited by the crawlers. FDNet  (the friends database) is the main component of the algorithm. The crawler first computes the set of top-mostterms of the pages already visited and then uses this to rank the pages that are about to be visited. d2c66b5586