Logic of the IPS. Criteria of semantic conformity, Estimation...

The logic of the IPS. The criteria of semantic conformity

As shown in Fig. 6.3, the iod logic of the IPS is understood as the criteria for issuing or the criteria for the meaningful correspondence (mandatory element), basic ( paradigmatic) and textual (syntagmatic ) relations between IPY words (basic and (or) textual relations may be absent).

The criterion of the semantic correspondence (CSA), the criterion of issuance, allows to decide the issue of issuing or not issuing a particular document, i.e. is the basis of the search algorithm.

There are the following types of CSS [14, 24]:

• KCC to the full entry, or on occurrence

The condition for the issuance of documents is the full entry of the POZ in the AML. In other words, the document is issued if the set of descriptors forming the POS (Mnoz) fully enters (Figure 6.8) into the set of descriptors contained in the AML (MAPL), or coincides with the Mpod, i.e. Mnoz & Igrave; Mpod


Fig. 6.8. Criterion for full occurrence

• KCC on partial occurrence .

PAZ enters the AML partially ( intersection AML and POS). The document is issued if the AML and PAZ coincide partially, i.e. if part of the descriptors contained in the Mpod match with the descriptors in Mnoz (Figure 6.9): Mnoz & Ccedil; Mpod.


Fig. 6.9. Criterion partial occurrence

• KCC taking into account the textual and basic relations.

The difference with the previous one is that the comparison of the descriptors of POS and AML must be carried out with the accuracy to coincidence of the textual relations into which their inverse images enter in the request and document respectively.

• KCC taking into account the weight coefficients of informative words or descriptors.

Each weight word in the request is assigned a weighting coefficient (W I ). The weights in the POS are user-defined and normalized. The sum of all weights in the query must be a constant (ΣWI = const). The delivery is separated depending on the sum of the weight coefficients of the query words that coincide with the words used in the document. The number of delivery levels, as well as the sums of weights (threshold) corresponding to each of them, are determined by the system developer in the course of its debugging.

• KCC with taking into account syntactic relations.

Grammar rules are introduced and the syntagmas formed from descriptors (or keywords) are compared with the entered rules.

Estimates of the quality of information retrieval and information retrieval systems

In the theory of information retrieval, various criteria for assessing the quality of the information retrieval system are proposed and used.

Developing a set of criteria for assessing the quality of information retrieval is a rather complex problem: the composition and quantitative characteristics of the criteria depend on the specific purpose and principles of the IPS implementation.

There are two types of estimates:

• estimates-descriptions, the values ​​of which characterize the system directly without regard to other systems;

• scores-scales, the values ​​of which determine the comparative advantages of various search engines.

From estimates-descriptions it is required that its values ​​make it possible to judge quite adequately about the essential properties of the evaluated objects, for example, to predict their behavior under certain specific conditions. In this case, the score-description is called effective.

From scores-scales it is required that its values ​​order a set of evaluated objects, for example, different IPS, without at the same time contradicting our existing substantial notions about the comparative merits of these objects. In this case, the score-scale called sound.

It should be borne in mind that the same formal evaluation can be considered both as an "score-scale" and as an "estimate-description."

A meaningful assessment implies an assessment of the usefulness of information for the consumer, for the results of his main activity. At the same time, the evaluation of the effectiveness of the information obtained implies an assessment of its utility and the costs of obtaining it. In addition, for a rigorous evaluation, it is necessary to allocate a share of the result determined precisely by the information received, which is extremely difficult to do.

With this in mind, instead of evaluating the effectiveness of search, they are limited to assessing the functional efficiency.

Estimates of search engines are divided into two classes, which are called external (or functional) and internal estimates.

Internal assessments are based on such structural qualities of the system as complexity, degree of proximity to human logic or natural language, degree of algorithmicness, on the evaluation of IPS components, and in particular information -search language (IPY), and the like.

For example, C. Meadow [13] proposes to evaluate the quality of the information retrieval language using the following criteria: semantic force ( expressiveness ), multivaluedness and compactness of the language, the cost of choosing a term.

Semantic force - is the ability of a language to identify an object, to distinguish between small features of objects, to describe an object with varying degrees of detail.

The potential of the IPY is implied, not the ability to use it. The natural language is the greatest semantic force.

Multivalued means that the word or syntactic unit of the thesaurus has more than one meaning (omographs), or on the contrary that some value may have more than one symbolic representation in the vocabulary of the IPY (synonymy). In addition, equally sounding words may have different meanings (polysemy or homonymy).

Synonymy and homography can exist in syntactic units consisting of several words.

Compactness characterizes the physical size or length of vocabulary terms or search patterns composed of the terms needed to display the meaning of documents and queries.

Cost characterizes the price of the decision-making process for choosing terms (keywords, descriptors or other syntactic units) to display the meaning of a document or query.

The total cost includes: the cost of teaching the use of the language, the cost of compiling and improving the vocabulary, the costs associated with eliminating the mistakes made in the choice of terms, the time spent on indexing documents and compiling the POS.

The estimates offered by C. Meadow are not independent and mutually exclusive.

The IPY can be semantically strong, but multivalued. The compactness of words in the vocabulary of the language does not determine the value, i.e. time and labor to choose terms.

IPYA is also characterized by vocabulary composition and the presence of grammar. In the presence of the thesaurus IPY can be characterized by its depth, i.e. number of levels, types of meaningful elements or syntactic units of the thesaurus. The characteristics of the IPY are internal estimates of the information retrieval system that influence the evaluation of the quality of information retrieval, by the criterion of per- tentness.

External, or functional, estimates are based on comparing the results of the system's work with the results of an ideal content search performed by an expert. In the information search theory, the notions relevance and of the pertinence are introduced for this.

By relevance is meant the correspondence of the issue to the request, i.e. relevance characterizes the quality of the search algorithm. Under pertentnostnostyu - compliance of the issuance of the needs of the person (or persons) for whom the information is being searched, i.e. the perpetuation characterizes the meaningfulness of the IPN, the accuracy of displaying information needs with it.

At present, sometimes the term relevance is used in a broader sense and distinguishes the relevance of the first kind (formal relevance), which corresponds to the term originally introduced in the theory of information retrieval [14], and the relevance of the second kind, corresponding to the notion of perpetuation.

To assess the relevance, criteria such as completeness, search accuracy, loss, noise, are used that can be represented in various relationships.

As the criteria for assessing the quality of information retrieval, the notion of a search correlation coefficient is introduced:

where a, b, c, d - relevance criteria (see Table 6.5).

