Creation and types of indexes, Coordinate index, Direct...

Creating and Index Types

The search engine is better the more "correct" page it shows the user in response to the request. Correct pages are called relevant (ie relevant, relevant).

To understand how a search engine manages to find the most relevant pages, you need to understand how the search engine index works.

The search engine for creating an index from the pumped out web pages performs the following steps:

1. Conversion to pure text.

To begin with, the text of the indexed page should be cleared of any non-text elements - graphics, HTML markup (tags), etc. The result is pure text, with which the index robot further works.

2. Word selection.

All words must be selected from the text, then to arrange them alphabetically. To do this, the search engine must know what is considered a word - a sequence of letters (and which particular alphabet), numbers, alphanumeric sequences, words with a hyphen, etc., as well as what the word does not count and passes (spaces, signs punctuation, etc.). Each search engine has its own definition of what is considered a word in the text (there is no standard here).

3. Linguistic processing.

In most search engines, words are not indexed as they appear in the text.

Typically, at the stage of sampling words from the texts of web pages, the search engine applies some sort of algorithm of linguistic processing of words, namely, the reduction of words to their initial grammatical forms, or basics. This algorithm is called machine morphology

4. Index compilation.

Collected together the basics of all words from all texts are reduced to an index - a kind of dictionary, in which the bases are ordered alphabetically, and with each basis it is recorded, from which page it is taken (page number) and where on this page this base was (died of entry). The foundations in the dictionary are sorted alphabetically for the convenience of searching them.

In reality, to save space and increase the speed of using the index, its structure is optimized and complicated in every possible way. For example, instead of the bases in the index, their numbers are stored, and the bases are stored separately; page numbers are written not every time, but only once for all entries from this page, etc. The index is then packaged to save space, indexed again to speed access, etc.

But the general idea of ​​index writing is exactly as described above.

Coordinate index

The first Internet search engines (mid-1990s) did not memorize the location of the word on the page. The index was recorded only the list of pages on which the word appeared. This was done to save space and to get a simpler index structure, in other words, for faster access to the index.

However, this restriction did not allow us to determine the relevance of the page sufficiently accurately when searching for word combinations. After all, the search engine could not distinguish the compact occurrence of the query words when they stand side by side, in a single phrase, from a spaced entry, when one query word, say, is in the upper right corner of the page, and the second in the lower-left corner.

As a result, for verbose queries, relevance was almost zero. For example, the Rambler search engine was built up to 1999.

With the growth in the number of verbose requests (and their share is growing all the time as the number of experienced users grows), and as search technologies developed, most popular search engines switched to an index that takes into account the word's coordinate on the page. Such an index is called coordinate.

The inclusion of compact occurrences of query words in a coordinate index allows not only to more accurately "weigh" the relevance of the page, but also to show the most appropriate quote from the text of the page.

The index is an inverted, inverted inside out a copy of all Internet pages. If in ordinary text we go from page to word, then in the index the search engine goes from words to pages. Therefore, the index of the search engine is called inverted or inverted, i.e. turned, inverted.

Direct index

To show citations with highlighted (highlighted) query words when pages are found, search engines store all texts of all indexed pages. Store, of course, in a compressed, packed form, without HTML-markup, graphics and other "garbage", in a purely textual form. But in any case, the search engine stores on its servers a copy of the entire Internet, which was retrieved by its search robot.

To store a text copy of the pages, the inverse index does not fit - for too long each time the quote is displayed, the word order in the text is restored. It is much easier to store the second index in the developer's jargon called direct. It is the texts of web pages cleared from all non-text elements, compressed and packaged, and is a text copy of the entire Internet.

For example, Google has a text copy of the entire world Internet (to the extent that its "spider" could reach), and Yandex - a copy of the whole Runet.

It is the presence of this text copy that allows search engines not only to display relevant citations in search results, but also to have the function "recover text of the page", which is convenient to use if the page itself is currently unavailable or even removed from the site.

thematic pictures

Also We Can Offer!

Other services that we offer

If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

How to ...

We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)