INFORMATION SEARCH SYSTEMS
After studying Chapter 9, the student must:
• the basic principles of information retrieval;
• Universal search sites, megapoint sites and directories on the Internet;
• components of the software of the search site and differentiation of their functions;
• methods for searching, narrowing and expanding the search results for documents and images;
be able to
• use search sites, their sections, simple and advanced search;
• skills of compiling a search phrase, selecting a section and a search area on the search site and in the directory.
The information arrays necessary for the development of modern society are huge and are fundamentally different from the information that was available several decades ago. Today there are no clearly expressed centers of knowledge concentration. Traditional sources of information: libraries, databases, archives are perceived not as separate information nodes, but as a set of many sources of information. The trend of information dispersal is most clearly seen in new information environments, such as global computer networks.
The dispersal of information sources is not only an opportunity to receive the necessary information, but also serious problems related to the search and classification of necessary information resources. The global information environment The Internet is millions of public information sources, practically on all possible topics. The complexity of orientation in this array of information is not even in its huge size and the availability of a variety of different data formats, but in the dynamic nature of information that requires the constant updating of information about the availability and location of information.
It is impossible to effectively use new information environments, in particular the Internet, without using advanced search engines - information retrieval systems (IPS).
General principles of construction of information retrieval systems
Basic principles of information retrieval. The problem of finding a document arises in any data warehouse. When creating storage systems, two models are used: hierarchical and hypertext. Hierarchical storage model implies multilevel system resources. To determine the path to the required resource, the descriptions made when sending the document for storage are used. Hypertext model allows you to link documents with links located directly in the text of the document.
With large amounts of information, high speed of their updating and heterogeneity of requests, the shortcomings of these models are obvious. Multi-level categorization and link placement is performed by highly qualified specialists, therefore the volume of documents processed by them becomes limited. Linked documents are limited to a specific subject area, which can be interpreted in different ways by the compiler and the user. When searching for a document, it is advisable to view a lot of documents that contain only links to other resources.
These shortcomings are deprived of information retrieval systems; Once created, they work autonomously. The principle of IPS interaction with the user is that the user enters a request processed by the system in this system and obtains a list of pointers to documents that satisfy the request. The list can be sorted by the relevance - of the degree to which the document meets the query.
The basic principles of information retrieval lie in the fact that an array of pointers to information resources is created. A pointer (index) contains a certain property of the document and references to documents that have this property. For example, the author's index allows you to get links to the work of a certain author, subject index - to select documents that affect certain concepts (objects). The process of creating pointers is called indexing, and the terms used for indexing are called indexing terms. In the author's index, the names of authors whose work is stored in the fund carry out the role of index terms. The set of used indexing terms is called the dictionary. The index array compiled after indexing the information resources is called the index base.
The index database is accessed through queries. So, the user's request must be translated into the indexing language. When searching, the query is compared with the available data and the user is given a list of links to suitable resources. To improve the efficiency of the system, the dictionary and index should be arranged according to the system most appropriate to the search problems in a particular domain.
The first information retrieval systems were created in the 1970s and 1980s. and continue to develop today.
Any information retrieval system uses an index that allows you to search for documents relating to a certain "item". To compile an index, the content of the document is analyzed and the subject or "objects", which are discussed in the document. The names of these subjects are translated into the information retrieval language (IPY), resulting in a document search image (AMP). By indexing (creating search images) all information resources, get the index database - the main array of IPS data.
The search process consists in matching the user's query with the available data, and the received query is also translated into the information retrieval language. After comparing the request and search images of documents translated from the IIS, the user receives a list of references to documents corresponding to the system's opinion in his request. The search is not based on the text of the documents, but on their search images compiled on the IPY. Therefore, the quality of the search engine depends primarily on its information-search language. The structure of information retrieval language includes:
1) the dictionary of index terms - a lot of terms of indexing;
2) code dictionary - a lot of code terms;
3) the dictionary of inputs is a set of input terms;
4) auxiliary tools of the indexing language - used in conjunction with indexation terms to expand or narrow certain concepts;
5) rules for using the indexing language.
To increase the efficiency of the search, the dictionary must be controlled, i.e. It should be organized in such a way that the completeness and accuracy of the search are optimal. Obviously, the organization of a dictionary depends on many factors - the subject area in which the IPS will function, the nature of the users' interests, the degree of their preparation, etc.
To improve the search results, it is necessary to determine the degree of specificity of terms when indexing. As a rule, two principles are used: the use of the most specific term, corresponding to the volume and content of the reflected concept, and excessive indexing. In redundant indexing, the search image is supplemented by terms associated with the main one. Terms can be used that relate to either the basic generalization or specification relationship or the associative relationship. Adding the search image to terms with associative links increases the completeness of the search, but inevitably reduces its accuracy. The disadvantages of excessive indexing also include an increase in the volume of search images. To address this problem, many IPSs use redundant indexing of documents rather than documents, but
Object indexing does not exclude the use of document attributes when creating a search image. It can be attributes such as author data, publication date, publication language, etc.
The accuracy and completeness of the search depends not only on the characteristics of the IPS itself, but also on how the query is created. The ideal query can be compiled by the user, fully familiar with the domain of interest, as well as with the applied IPS. However, such an IPS user is obviously not required. The rest of users are forced to content themselves with either low search accuracy, or low completeness.
To improve the search quality, there are different methods. The most used of them is the use of
logical operators AND, OR, NOT. This is a fairly simple way to increase the relevance of issued documents. The disadvantage is poor scalability. The AND operator can greatly narrow the search, and the OR operator can greatly expand. The degree of accuracy and completeness of the search depends on how general terms were involved in the formulation of the query. It may be incorrect to use both the most common terms (the level of information noise increases) and too specific terms (the completeness of the search is reduced). The use of too specific terms is fraught with the fact that in the IPS dictionary of this term may not be. In general, the search procedure is an iterative procedure, i.e. After the stage of issuing the search results, you should correct the query, search for this query, etc. Schematically, the procedure is shown in Fig. 9.1. Correction of the request occurs depending on the number of received documents and their relevance and can be performed by both the user and the information retrieval system itself.
Depending on the completeness and accuracy of the documents found, the user can narrow or expand the scope of the search, moving to more general or, conversely, more specific terms, and also using related concepts. In the case of searching by several terms, this correction of the search area can occur according to one of several terms, which allows you to change this area smoothly enough. It can be useful to know the user about the availability of specifically relevant documents. Not finding them in the list of found documents, the search area should be expanded. The request is corrected by the information retrieval system based on the analysis of the documents marked by the user as most closely suited to his needs. In this case, the next time the system is searched, the system searches for documents that contain, in addition to those specified in the original query, terms that appear in documents marked by the user. You can improve search results in various ways, if the functions for this are provided by the interface of the information retrieval system.
Fig. 9.1. Search procedure
Recently, many IPSs have a hint function when entering search query text that takes into account previously entered queries on similar subjects for a certain period of time.
The interface of the system. An important factor, largely determining the effectiveness of the search, can be the type of presentation of information in the program, i.e. its interface. In terms of the form of the dialogue, the way of setting the selection condition and the search mechanism, the software can be divided into systems of the rubricational type and structurally-logical systems.
The first are implemented by the interface in the form of hierarchical consecutive drop-down lists, through which access to thematically related groups of documents is provided. Expanding the next heading and moving in this way on the thematic hierarchy, the user specifies the subject area and increases (in the order of) the degree of accuracy of the compliance of the issued documents and information needs. The predetermination of the correlation of documents with separate headings is compensated by the logical nature of the natural scientific classification scheme that replaces the user guide.
Structural-logical methods of query generation are used to work with structured information databases when each document consists of many information fields, possibly of different types. The selection criterion is constructed as a logical combination of simple, reducible conditions for the presence or absence in the document of words (proper names or the names of the concepts defining the subject of the search).
When composing a request to the system, use either the menu-oriented approach, or command line. The first allows you to enter a list of terms, usually separated by a space, and choose the type of logical connection between them. Logical communication applies to all terms. Many IPSs allow you to store user requests - in most systems this is just a phrase on the IPN, which can be expanded by adding new terms and logical operators. But this is only one way to use saved queries, called extension, or refinement, query. To perform this operation, the traditional IPS does not store the request as such, but the result of the search is a list of document identifiers that is combined or intersected with the list obtained when searching documents for new terms.