Information-search language, The concept of information-search...

Information retrieval language

The concept of information-search language

The use of a natural language for mapping AML and POS is associated with significant difficulties due to the presence in the language of synonyms, homonyms and similar ambiguities in the use of natural language terms. Therefore, at a certain stage in the development of the theory and practice of creating IPS, instead of natural language, artificial information retrieval languages, the IPN, began to be used.

There are various names and definitions of a specialized language that help to reflect the main content of documents that are entered into the IPS.

The retrieval language is a specialized artificial language designed to express the main content of documents or information requests in order to find documents in a certain set [14, p. 259].

The information retrieval language (IPA) is used to display the contents of the information retrieval system documents in the document search image - the UND and the query in the search query image - POS, or search requirement.

Such a language was called at first an information language (IJ), requiring it to unambiguously record the contents of the document, the index language, defined as a collection or system of characters or index terms and rules for their use to express the subject matter of documents, documentary language (language documentaire) , etc. (see the review of these terms in more detail in [14]).

In the final version of the conceptual apparatus of the theory of information retrieval, the term is the information retrieval language.

Summarizing different ideas about the information retrieval language, we can give the following definition:

is a formalized semantic system that ensures the transfer (recording) of the contents of a document to the extent necessary for search purposes.

A document written in this language may not, in principle, be understood by a person, even if the words use natural language words, because the use of words, expressions, relations between them is standardized in a certain way in the IPY.

The goal of the IPY is to translate the contents of the document into the search rule, or document search image (when entering the document into the IPS) and translate the content of the user's request into the search query image (search instruction).

The first researchers selected as components of the IPY: alphabet (set of alphabetic and numeric characters), words formed from the alphabet using morphological rules - morphology, dictionary translations (in which each word or meaningful construction of a natural language is associated with a word or phrase of the IPY), rules reflecting the relationship between the words of the document that are realized in the particular IPY, for example, using textual or contextual relations or by using special gram rules atiki - syntax.

The dictionary can consist of keywords ( word combinations ) or descriptors. At first some authors (for example, Meadow [13]) identified these concepts and understood by the descriptor all the words chosen for inclusion in the dictionary.

However, in the future, the term descriptor was given a more complex meaning, in contrast to keywords selected in advance from the documents of the array for which the IPY is being searched; under descriptor is meant some (selected by the developer of the IPY), a general term for displaying a group of synonyms or words that can be considered synonymous for search purposes in a particular IPS.

Such words are combined into a conditional equivalence class , generalized by the corresponding descriptor, and if a word from a given class is found in the text of the document or query, then it is replaced in the AML or POS descriptor .

Thus, the descriptor is a special concept introduced and used in the theory of information retrieval. In modern information retrieval languages, the descriptor is the name class conditional equivalence [14, 24].

The conditional equivalence class is formed from keywords associated with paradigmatic relations.

Paradigmatic (basic) relations - one of the types of semantic relations proposed in the theory of information retrieval and used in the development of information retrieval languages.

Paradigmatic relations are the extra-text semantic relations between lexical units of the IPY, which are established on the basis of information search needs.

The role of paradigmatic relations boils down to the following. The fundamental feature of natural language is the fact that in it the same events can be described in different terms. Then, in the document's search image - UNDER and search query image - POS, different words can be used, preserving the meaning of the document and the query.

In addition, in practice, it may be necessary to look for documents that deal with more specific concepts than in POS. Do not lose these documents can help to introduce the paradigmatic (basic) relationship between the descriptors of the IPY.

In a broad sense, paradigmatic relations include the relations of synonymy (identity of signifiers with the difference of signifiers), homonymy (the identity of signifiers with the difference of signified), relations (declination and conjugation paradigms) based on the same basis for different endings.

However, in the narrower sense, when developing the IPN, sometimes it is suggested to understand paradigmatic relations only the only relations between words ( meaning ), which are based on the existence of certain connections between the signified "quotations" [14, p. 433].

Different specialists suggest different ways of determining paradigmatic links: by the similarity of objects, by belonging to the same class, associative relations in contiguity in space and time, by similarity, by contrast, relations subordination, "species - genus", "cause - effect", "part" - integer etc.).

It is allowed to arbitrarily establish relations in a specific IPY, with an emphasis on improving the effectiveness of information retrieval.

In particular, E. S. Bernstein, D. G. Lahuti and In. S. Chernyavsky used in the development of the IPS "Empty - non-empty" paradigmatic relations, which is defined as the relationship existing between the words retrieval language regardless of the context, calling it their basic relations, and set their list (including a thesaurus). These relationships increase the semantic force of the system, allow you to formulate queries in terms different from the terms used in relevant documents.

Fixed base relationships can be specified in various ways: using a word structure, as in UDC, using a reference system, using descriptor trees, and the like.

It should be borne in mind that in an effort to improve the search results, you can increase the "noise", i.e. excess output.

In different languages, the components of the IPY are used differently. The dictionary can have a rather complex structure, i.e. is a thesaurus, which can include both the alphabet, and words, and word combinations, and more complex constructions.

The term thesaurus (from Greek θηδαυροζ, thesauros - treasury, wealth, treasure, stock, etc.) in the general case characterizes "the totality of scientific knowledge about the phenomena and laws of the external world and the spiritual activity of people, accumulated by the whole human society" [14, p. 85J. This term was introduced in the modern literature on linguistics and informatics in 1956 by the Cambridge Language Study Group. At the same time, the term existed earlier: in the Renaissance, thesauruses were called encyclopedias. An overview of the thesaurus definitions and the first thesauruses can be found in [14, p. 415-432, 469-505J.

A special role in the formation of the thesaurus is played by paradigmatic relations, which historically are the element of the logic of the IPS.

In mathematical linguistics and semiotics, the term thesaurus is used in a narrower sense, to characterize a particular language, its multi-level structure.

For these purposes, it is convenient to use one of the definitions of the thesaurus adopted in linguistics as the " set of meaningful elements of a language with given semantic relations .

This definition is illustrated in Fig. 6.4 on an elementary example of the formation of words from letters and sentences from words.

Principles of formation of the thesaurus structure

Fig. 6.4. Principles of the formation of the thesaurus structure

Of course, in real thesauri the levels have different names: keywords, descriptors, paragraphs and other linguistic and logical elements.

In this case, between the levels of the thesaurus there can be different relationships - from tree-like hierarchical to causal.

Thus, the thesaurus allows us to represent the structure of the language in the form of levels (strata) of sets of words, sentences, paragraphs, etc., the meaningful elements of each of which are formed from the semantic expressions of the preceding structural levels.

Rules for the formation of meaning-expressing elements of the second, third and subsequent levels in the thesaurus are not included. They form the grammar of the information retrieval language (Gl, G2 , etc.). In the thesaurus, however, only the type and the name of the level, the character and the type of the semantic expressions are determined.

Sometimes, instead of the term meaningful elements , the term syntactic units of the thesaurus is used. However, this is a less successful term, since in the formation of elements of a new set of semantic expressions of each successive level (in the formation of words from letters, phrases and sentences from words, etc.), the elements of the newly formed set have a new meaning, i.e. as if the regularity of integrity is manifested, and this is well reflected in the term sense-expression & quot ;.

The notion of thesaurus was first used in the development of information retrieval languages, but later it was also used to create other artificial modeling languages, design automation.

Thesaurus allows you to characterize the language in terms of levels of generalization, to enter the rules for their use when indexing information. In the theory of scientific and technical information, various properties of the thesaurus are explored.

You can talk about the depth of the thesaurus of a language characterized by a number of levels, about the types of generalization levels and, using these concepts, compare languages, choose the one more suitable for the problem under consideration or, having characterized the structure of the language, organize the process of its development.

In the practice of creating information retrieval systems, the most popular is the thesaurus dictionary Thesaurus ASTIA .

There are two kinds of thesauruses in the SMART system:

• a thesaurus with a hierarchical structure of concepts. It gives an opportunity for any number of concepts to find their "parent", "sons", "brothers", " and a lot of possible cross-references;

• Thesaurus in the form of a dictionary of synonyms.

It is used to replace meaningful words with concept numbers, each of which represents a class of the foundations of words that are close in meaning.

The simplest thesauruses are the descriptor dictionaries when interpreting the descriptor as the name of the conditional equivalence class formed on the basis of paradigmatic relations.

Thesauri are developed in domestic industry systems of scientific and technical information (for example, in ASNTI-geology). The term thesaurus is sometimes used in a broader sense.

For example, Yu. I. Shemakin tezaurus calls a complex system of organization in automated control systems and information processing in its various forms (scientific, technical, managerial, represented in documentary and factographic form).

The morphology and syntax is conveniently combined with a single term - grammar. Then it is said that the IPN consists of thesaurus and grammar, and then consider the thesaurus's semantic elements (syntactic units) and grammar rules.

grammar (sometimes called syntax, syntax, which narrows the concept of grammar, excluding morphology ) are understood the rules by which the semantic expressions of the language are formed. Using these rules, you can generate (to form) grammatically (syntactically) correct constructions or to recognize their grammatical correctness.

The simplest rules of grammar are syntagmatic (textual) relations. The syntagma is a rule of the type { a i, r to , b j } where a i & Icirc; A; b j & Icirc; B - interacting sets (subclasses) of the original concepts of the language; r k e R - a set of relations that can be of arbitrary form.

When creating and using artificial languages ​​for information-logical systems, the concepts of mathematical linguistics and formal languages ​​are used, in particular, the concepts of generating and recognizing grammar.

A generative grammar is understood as the set of rules that enable the generation ( generation ) of primary elements (vocabulary) of syntactically correct constructions.

Under the recognition grammar are rules that enable the syntactic correctness of sentences, phrases, or other language fragments to be recognized.

On the basis of mathematical linguistics, the theory of formal grammars of N. Chomsky develops. Classes of formal grammars by N. Chomsky are considered to be the basis of the theory of formal languages.

The formal language is defined as the set (finite or infinite) of sentences (or "chains"), each of which has a finite length and is constructed using some operations (rules) from a finite set of elements (symbols ), which make up the alphabet of the language.

The formal grammar is defined as four sets:

(6.9)

where V T is the set of basic, or terminal, symbols; V N is the set of auxiliary, or nonterminal, symbols; R - the set of output rules, or products, which can look like:

(6.10)

where , i.e. β - a chain of finite length from the terminal and nonterminal symbols of the sets V T and V N ; , i.e. but is a chain of terminal and nonterminal symbols containing at least one nonterminal symbol from V N ; A - the set of axioms (in grammars of combinatorial type, to which the N. Chomsky grammar belongs, A consists of one initial symbol S, where ).

Considering that in the literature on formal grammars, as a rule, they do not aspire to a meaningful interpretation of the findings, but consider only the formal side of the processes of generating and recognizing the affiliation of chains to the corresponding class of grammars, we give a meaningful example of generating grammar.

Suppose given:

Generative grammar Recognizing grammar

(6.11)

Applying the rules R of the left-hand side of (6.11) in the above sequence, we obtain

This is the formal side of the process of generation. In order to obtain an interpreted expression, it is necessary to decipher the terminal symbols included in V N, where b, - ALL; in 2 - AGES; n - FEELING; л - LOVE.

Then the received sentence:

in 1 in 2 pl - ALL AGES ARE CONCERNED BY LOVE & quot ;.

If you change the sequence of application rules, then other suggestions will be obtained. For example, if you apply rules in the sequence (1) & THORN; (3) & THORN; (2) & THORN; (4) & THORN; (5), you get the "AGEs ALL ARE SURPRISED OF LOVE". If you do not apply all the rules: for example, (1) & THORN; (2) & THORN; (4) & THORN; (5), then we get "ALL LOVE CONCERNED".

If you try to get an offer like A. Pushkin's - "Love of all ages are submissive," then no matter how we change the sequence of rules, we can not get this phrase. It is necessary to change the first rule: instead of S SP include in the R the rule S PS.

From the example it is seen that the type of generated chains (sentences) depends on the kind of rules ( of the calculus ) and on the sequence of their application ( algorithm ).

Using the above example, it's also easy to demonstrate the close connection between the grammatically correct with language (grammar).

The recognizing grammar for the example in question will contain, as it were, inverted the rules are the right-hand side of (6.11), which should be applied in the reverse order. An example of representing the analysis of the correctness of a sentence using the rules of the recognition grammar is shown in Fig. 6.5.

Example of sentence analysis using rules of recognition grammar

Fig. 6.5. Example of sentence analysis using grammar recognition rules

If you recognize the correctness of the sentence, if you do not stipulate that the sentence (string) is grammatically correct from the point of view of the rules of the given formal language, then you can use the formal grammar in its original form to deduce that the quoted phrase

Pushkin is grammatically incorrect in terms of grammar rules (6.11).

Indeed, from the point of view of grammar rules for constructing the business text to which the rules (6.11) correspond, other poetic lines would often receive a formal evaluation "grammatically incorrect". And, on the contrary, if you build a grammar based on the analysis of the Pushkin style, then in the business text you would get sentences like "I made my decision right" (like the phrase "I erected a monument to myself not made by hands").

The concept of formal grammar is used when creating a modeling language for a literary or musical work - parodies, imitations or, as it is sometimes said, works of the appropriate style or class.

For example, the works of Р are known. X. Zaripov on modeling music in style, or in the class of popular Soviet songs on modeling the process of composing poems, etc.

Similarly, it is possible to simulate the generation of business letters or other documents that, as a rule, have not only a formalized style, but also a formal structure. Similarly, you can create modeling languages ​​for structures, languages ​​for automating the design of complex devices and systems of a certain type (class).

The basis of such works are ideas that can be explained with the help of classes of grammars first proposed by N. Chomsky.

The separation of grammars into classes is determined by the type of output rules R. Depending on them, we can distinguish four basic, most often considered classes of grammars (Table 6.4). In the complete theory of formal grammars with rules of substitution type, there are intermediate classes.

In the theory of formal grammars it is shown that the following relation holds:

(6.12)

Sometimes it is proved that there is a strict occurrence:

(6.12, a)

Table 6.4

The main classes of grammars N. Chomsky

Class

Characteristic

1st class. Non-expanding (NU-grammars)

Only one requirement is imposed on the output rules, that in the left part there are always less symbols than in the right one, i.e. so that the rules are non-contracting, they reduce the number of characters in the output chains. Sometimes these grammars are called grammars of type zero (zero type) or algorithmic

2nd class. Contextual , contextually-related

On the output rules, in addition to the requirements of inclusiveness, a restriction is imposed that at each step only one symbol in the context should change, i.e. so that Z1 BZ2 → Z1 WZ2, where B is one nonterminal symbol, W - a nonempty string of characters, i.e. W '. Sometimes a term is used - the grammar of immediate components (NC grammar)

The third class. Context-Free (CS-Grammar) or Context-Sensitive

In addition to non-contraction, it is required that the rules have the form B → β, i.e. but always consists of one auxiliary character

4th grade. Automata (A-grammars)

Another limitation is imposed on the output rule in comparison with the third class, which requires that in the output rules the nonterminal symbol always stands on the right or on the left, i.e. one side. If the nonterminal symbol is on the left, i.e. the rules have the form A ≠ aB or A → a, where (A, B) & Icirc; VN , ae V T, the automaton grammar is right-handed; if the nonterminal symbol stands on the right - then the automaton grammar is called left-handed

In the study of different classes of formal grammars, results are obtained that lead to the conclusion that as the number of restrictions imposed on the rules of inference decreases, that is, As we move from left to right in (6.12), the ability of the sense mapping increases in language. the possibility of expressing by means of formal rules the semantic features of the displayed text, the problem situation. They say that the formal system becomes richer. However, in this case the number of algorithmically unsolvable problems grows in the language, i.e. the number of positions increases, the truth or falsity of which can not be proved within the formal system of language.

Here we are faced, in fact, with the problem of Gödel, which in the theory of formal languages ​​is usually discussed in terms of this theory. Namely: the concept of the operation is defined (or is not defined ) is introduced on the set of languages ​​of this class & quot ;; and consider that the operation is defined on the set of languages ​​of the given class, if after applying it to the languages ​​entering into the ego set, a language belonging to the set of languages ​​of this class is obtained.

For example, if I 1 & Igrave; CS and I 2 & Igrave; KS, and if (H 1 and H 2 ) & Igrave; CS, then the join operation is defined on the KC -language class.

Characterizing the language classes with the help of the introduced concept, note that in the sequence (6.12), as we move from left to right, the number of operations that are defined on the set of languages ​​of this class increases.

Here, it is true, it should be stipulated that this is not so straightforward. It would be more accurate to say that for a large number of operations there is no evidence that they are defined on classes of NC-languages ​​and OU-languages, i.e. these proofs become more complicated or even (by virtue of Gödel's theorem) impossible to implement by means of the theory of formal grammars.

The given simplified representation of the problem helps to draw the attention of those who will be engaged in the development of programming languages ​​or program systems, modeling languages, design automation, the need to take into account the following regularity: the greater the semantic features the sign system has, the more the number algorithmically unsolvable problems (that is, the less formal procedures in it are provable).

When you enter the class of arbitrary grammars, in which even the condition of incoherence is not fulfilled, it is practically impossible to prove the admissibility of certain formal transformations by means of mathematical linguistics, and therefore, in search of new tools, researchers turned to semiotic concepts. Here it is possible to draw a formal boundary between linguistics and semiotics.

When creating an IPY with a thesaurus and grammar, an important role is played by the concepts of semantics and pragmatics.

By semantics is meant the content of the value , the meaning of the formed or recognizable language constructs; under pragmatics - usefulness for a given purpose, tasks.

In a natural language, it is difficult to distinguish between concepts that use the terms semantics and

pragmatics ; it is usually possible to explain the difference only if the terms are pairwise compared [27]:

& lt; semantics & gt; :: = & lt; content & gt; | & lt; meaning & gt; | & lt; value & gt ;;

& lt; pragmatics & gt; :: = & lt; meaning & gt; | & lt; value & gt; | & lt; utility & gt;.

Therefore, it is customary to consider these concepts on examples. Let us explain the difference between semantically and pragmatically correct language constructions on the following easily remembered examples.

Traditionally, to explain syntactic correctness and semantic nonsense, the proposed L is used. V. Scherboy example "Glokaya kuzdratchetoto borzdra bokra and the Kurds are even a sidebar" (in which there is simply not a single word of natural language that makes sense). But examples can also be found in natural speech.

The sentence "The Fly slyly splashed teeth" syntactically correct, but does not make sense in the natural United States language in everyday, widespread use, i.e. from the point of view of users, the United States language is semantically incorrect (for the time being, let us exclude a hypothetical situation of a fairy tale in which a fly can be endowed with the indicated properties).

Another suggestion "Little girl collects flowers in a meadow" - syntactically and semantically correct. However, for the director of the plant (if it is a meadow, but not a factory lawn, and - we take into account the personal factor - if this girl is not his daughter) this offer does not carry any information, i.e. pragmatically (from the point of view of the manager's goals) is wrong. It's another matter if "Ivanov (who is currently in the workplace) collects flowers in a meadow." Then this proposal would be pragmatically correct as well.

Let's return now for example with a fly. The above sentence is semantically incorrect, may be

pragmatically correct in the hypothetical situation of the tale, which is important to bear in mind when applying linguistic representations.

More detailed explanation of this problem is helped by types and measures of information A. A. Denisov (see Chapter 1). For example, the received message "The plant fulfilled the plan. The plant fulfilled the plan. The plant fulfilled the plan a technical device that perceives it can evaluate up to a sentence if it is set up so that it reads the text to a point and takes it as a sentence (then ΔA = 1 sentence, and J = A/ ΔA = 3 Proposal/1 Proposal = 3); up to words, if it perceives the text before the space as a word (then A = 1 word and] = A/ ΔA = 6 words/1 word = 6); and possibly up to the letters (then J = 51), and the person who received this text (or a device capable of eliminating duplication) can say that he received one information, i.e. H = J/n = = 3 suggestions./3 = 1, if the scope of the concept is determined up to the number of sentences, i.e. n = 3 and, if in its representation the sentence is semantically correct, i.e. for someone it might make sense:

C = J • H = 1 • 1 = 1.

But a person can say that he has received 0 information, because it for does not make sense. In this case, he applies the probabilistic pragmatic measure (1.17, c) for estimating R; will estimate the degree of influence on his goals p ' = 0, and then H = log (1 - p' ) = 0.

Using these rules, you can better reflect the meaning of the document or query in the AML and POS, increasing the relevance of the search.

Also We Can Offer!

Other services that we offer

If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

How to ...

We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)