 |
 |
|
Finding Web pages and downloading their contents. |
 |
| |
|
|
 |
The bulk of this task is handled by two components : the crawler and the scheduler. The crawler’s job is to interact with Web servers to download Web pages and/or other content. The scheduler determines which URLs will be crawled, in what order, and by which crawler. |
 |
| |
 |
|
Storing the contents of Web documents and extracting the textual content. |
 |
| |
|
|
|
The primary components at this stage are the database/repository and parser modules. The database/repository receives the content of each URL from the crawlers, and then stores it. The parser modules analyze the stored documents to extract information about the text content and hyperlinks within. Depending on the search engine, there may be multiple parser modules to handle different types of files, including HTML, PDF, Flash, Microsoft Word, and so on. |
 |
| |
 |
|
Analyzing and indexing the content of documents. |
 |
| |
|
|
|
This is handled by the document indexer. The text content is analyzed by the indexer and stored in a set of databases called indexes. The main focus of this phase is the on-page content of Web documents. |
 |
| |
 |
|
Link analysis, to uncover the relationships between Web pages. |
 |
| |
|
|
|
This is the work of the link analyzer component. All of the major crawling search engines analyze the linking relationships between documents to help them determine the most relevant results for a given search query. Each search engine handles this differently, but they all have the same basic goals in mind. There may be more than one type of link analyzer in use, depending on the search engine. |
 |
| |
 |
|
Query processing and the ranking of Web pages to deliver search results. |
 |
| |
|
|
|
The query processor and ranking/retrieval module are responsible for this important task. The query processor must determine what type of search the user is conducting, including any specialized operations that the user has invoked. The ranking/retrieval module determines the ranking order of the matching documents, retrieves information about those documents, and returns the results for presentation to the user. |