Indexing problems - Informatics for economists

Indexing Problems

Initially, the goal of the search engines was to index the web pages, i.e. texts in HTML format posted on the site and displayed by the web server using the HTTP protocol. Later it was discovered that a lot of useful information was posted on the Internet in the form of articles, price lists, documentation, manuals, etc. in different office formats. Therefore, most search engines in 2004-2007. began to index the documents laid out on the site in the formats MS Word, PDF and MS Excel. The widespread adoption of dynamic pages in Flash format caused search engines to index texts that are hidden in this file format.

However, do not count on indexing your information in these exotic Internet formats, as there is no guarantee that the search engine indexes them well. If possible, you should always duplicate any important texts in HTML format on the site.

For example, you should always have a price list in the form of a regular web page, because the search engine can not reach the price list in Excel format, and if it does, indexing, searching and displaying it in search results will necessarily be lame, because the search engines do not know how to parse the structure of Excel files as well as HTML pages.

Note that the search engines do not index texts that are displayed dynamically on the user's screen in a variety of software tools, like scripts in the Java Script language.

Theoretically, the depth and volume of indexing are not limited, but in practice the search engine will begin to download millions of pages from your site (if they are there). After all, the search engine in the queue for indexing other than your site is worth millions of other sites, so it tries at a time, in one pass to take from each site a reasonable number of pages. On the next cycle of indexing, the search engine can take some more of your pages, etc. In order not to take too much every time, the search engine tries not to dive too deeply into the links inside your site.

This means that even with a large number of pages, the site should be reasonably organized, for example, there should not be pages that can be reached only through a chain of ten links.

Very often sites with a large amount of data store their pages in a database (for example, MySQL or Microsoft SQL Server). This is much more convenient for storing and updating the site, because the database makes it easy to add, modify and delete information.

How do search engines handle such sites, can they index them?

The answer is simple: if the pages of the site are issued from the database when navigating through the links inside the site, then the search engine basically does not care where they come from under such a transition. Whether the page is on the site or generated dynamically when clicking on the link - it does not matter for indexing. But if you need to enter some query to the database to get the page, then the search engine for such pages simply "does not see".

Thus, when creating a site, you need to remember that the search engine indexes on your site only what is hypertext link. Large databases with a single way to access their content as a search box for the search engine are invisible. There are a lot of such bases on the Internet, that's why they are talking about the "deep Internet", which is invisible to search engines and which is more visible in tens or even hundreds of times.

How often does the search engine bypass the Internet?

More precisely, this question can be formulated as follows: how fast are new pages appearing in the search engine index and how often does the search engine update them later?

Of course, the ideal search engine should have every page in its index as soon as it appeared. And the existing search engines are eager for this. However, the huge volume of the Internet places its obstacles and limitations here.

Bypassing once a month at the beginning of this century, Yandex and Rambler have so far reached weekly indexing. However, since there are such types of information (news, prices, exchange rates) for which updating once a week is extremely slow, the search engines have a special "quick robot" that can bypass rapidly changing sites several times a day.


How do sites get listed on this fast robot - separate conversation. The search engine has self-learning mechanisms "fast robot". If your site is already authoritative enough (has a high reference rank) and still has many pages that often change, it has quite a few chances to be noticed by the "fast robot".

Also We Can Offer!

Other services that we offer

If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

How to ...

We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)