Organization of the search. Search Engines
The main element of the IPS structure on the Internet are search engines, or search engines. There are many different search engines, but among them there are the main, the most famous and visited. The global Internet is now dominated by Google. In the United States, or rather, in the United States-language Internet (Runet), the popularity of the search engine Yandex keeps high.
Why did the search engines take the most important place on the Internet? Because they order chaos. Sites and their pages are scattered on the Internet without any order, without the first or last page, without a way to go to the next page.
When reading a regular book, the usual ways to find the page you need are the table of contents, links, and index. These same methods are also used on the Internet, they are simply automated and executed by special programs.
The first, most natural way to find the right page is the table of contents of the book. The reader of the book looks through its table of contents, finds the chapter necessary to it, sees the number of the necessary page and opens it, flipping the book to the desired page number.
This way of searching on the Internet corresponds to directories. In them, the pages (Internet sites) are arranged in sections, so that the user can sequentially look through the table of contents of the catalog, select the desired rubric, view the sites related to it, and then go to the desired site or page.
At first it was directories that were the main way to streamline the Internet (in the mid-1990s), but then gradually gave way to search engines - and there were many reasons for that.
The second habitual way of searching is to link in the text to the necessary pages of the book, for example for more details, see p. 254 & quot ;. To find the desired text, the reader of the book should open the specified page 254 and find the fragment of interest in its text.
On the Internet, the idea of a reader's references from a page to a page turned into automatic links to pages where the user simply clicks the mouse. Links on the Internet are called hypertext links ( hyper - because the link takes you outside the text, to another page).
References are the primary, root the principle of the Internet, and in fact it is essentially an old idea, just an automated text link.
Links in directories and on normal sites are most often placed manually - the webmaster marks out a fragment of the text with special notices and appends the address of the corresponding page to it. Of course, often the links are placed and automatically when creating a web page.
The third way to find the page you need is an alphabetical list of important terms at the end of the book, the so-called subject index, or index. The index lists important for this book terms (keywords) and page numbers on which these terms are found. If the reader of the book can not find the desired page by the table of contents, he can guess what words can occur on it, and look into the index.
It is this idea of finding the right page for keywords in the index and has become the main idea for creating Internet search engines. The compilation and use of the search index on the Internet is automated.
In fact, when a user enters a search query into a search engine, he refers to the Internet index, or index, a list of all Internet keywords with the pages that they meet.
A search engine compiles and stores the Internet index, and finds the specified keywords in it.
Consider the main steps in the process of index compilation and search for it.
1. Collecting address pages on the Internet.
To compose an index on pages, you first need to decide which pages we need. Thus, you must first make a list of pages - a set of addresses of those pages, which will be indexed.
Because the sites and their pages are randomly scattered on the Internet, a search engine needs to start somewhere. Typically, the developers of the search engine load into it some initial list of addresses of pages of sites (taking it, for example, from some directory). Then the search engine (its component - the so-called search spider (in English crawler ) or the crawler) collects all hypertext links from each of the given pages to other pages and adds all addresses found in the references to their initial set of addresses.
Thus, the initial set of address pages quickly increases due to links to other sites and pages and gradually becomes very large. Now the search engines bypass and index billions of web pages.
2. Pumping out pages.
To work with the text of the page and compose an index from it, the search engine should get this text.
To do this, the search engine should extract this text, i.e. request the site for the specified page. The search robot bypasses the list of pages specified at the previous stage, extracts a huge amount of raw text material, stores it, and transfers it to indexing for the index robot.
3. Indexing, or indexing.
To compose an index, the search engine's index search engine should select all words from all the extracted texts and arrange them in alphabetical order, along with page numbers and different service information about each page.
To do this, the index robot goes through all the pages pushed out, numbers them, removes any unnecessary, non-textual "garbage" from the text of the pages. (for example, HTML markup), then extracts words from the text and places them in the index. In this case, words are supplied with information about the pages from which they were taken.
All the previous steps described are invisible to the user, they are executed in the search engine. And here is the search itself - this is what the user sees. The user enters his query (a word or a phrase) in the search string, and the search engine provides a list of links to pages on the Internet.
When a user enters a word in the query string of the search engine, the search engine accesses the index, finds an entry about the specified word, extracts all page numbers related to the specified word, and shows the search results to the user, i.e. list of pages.
In the list of results, the title of the page (the so-called title) is usually displayed, the date of the page creation, its address, a quote from the text of the page with the highlighted search word. If there were several words in the query, then the search engine compares lists of page references for each word and selects only those pages whose numbers are repeated, i.e. occur in each list of pages for each word. Thus, only those pages on which all the words of the query are found are selected.
Here is the very essence of the search engine for the index, its main principle, but in reality developers of search engines use a lot of various tricks.