Indexing systems - Theory of information processes and systems

Indexing Systems

The procedure for translating from natural language to IPN is called indexing . The result of such a transfer is AML (when entering documents in the IPS) or POS (when indexing the user's request).

The problem of indexing is related to the semantic analysis of the texts of documents. Its complexity is due to the fact that the indexing of documents entered into search arrays and user requests are spaced in time.

For algorithmization and automation of indexing, it is necessary to solve the problem of choosing the most significant keywords, descriptors, phrases (depending on the lexical units of the IPY) for inclusion in the AML or POS.

Importance can be determined by several characteristics:

• Statistical, i.e. based on the frequency of use of the term in the document;

• on the basis of the author's statements (his views reflected in the title of the document or the subtitles allocated by the author in the document);

• Using a grammar to reflect the relationships between lexical units contained in the context;

• According to the importance criteria formulated by the user, for which indexing of documents can specify the weight coefficients of the descriptors.

The indexing system of a specific IPS is mainly determined by the capabilities of the IPN, which are available in it by lexical and syntactic means. However, there are some specific rules and recommendations, the study of which allowed us to identify some varieties of indexing systems.

There are different types of indexing systems.

1. The first type is the system of free indexing.

With this method, words or phrases are written out of the indexed document in the POD, which reflect the contents of the indexed document. In addition, the elements of the AML may be words that are not present in these documents, but reflect more accurately the meaning of their texts in terms of the purposes of creating the IPS. The elements written out are arranged in alphabetical order. Such an ordered set of words (phrases) is a COD under this type of indexing. Similarly - from the text of the user's request, a POS is formed.

Such an indexing process is essentially non-algorithmic, i.e. manual.

2. In the second method, which is conditionally called the semi-free indexing method, the words and phrases are written from the document in the beginning, just like with free indexing.

However, the elements written out are then compared to a fixed dictionary, those not found in it are eliminated, and the remaining ones, arranged in alphabetical order, are AML (or POS).

3. The third method of indexing is based on the statistical approach.

The choice of words (expressions) of the source text to be included in the AML is made on the basis of a statistical analysis of the text, in which its words are regarded as signs that do not have semantic meanings. At the same time, various statistical criteria were proposed, based on a comparison of the relative frequency of word usage in the document and the relative frequency of word usage in a representative array of documents (ie, in a representative statistical sample).

For example, the following quantitative criteria are proposed in [14]:

where F is the relative frequency of word usage in the document; R is the relative frequency of word usage in a representative array of documents.

It is easy to see that the basis of the above relations is the idea that the information significance of a word is determined by the discrepancy in the frequency of its use in this document and throughout the flow of the documents under consideration.

Different approaches are possible to determine the discrepancy:

• According to the first, the discrepancy between the frequency of word usage in the flow of documents of a given subject (monothematic flow) and the frequency of occurrence of this word in a multitemmed document flow (polythematic flow) is calculated;

• The second principle is based on calculating the discrepancy between the frequency of word usage in the flow of texts of a given topic and the frequency of the same word in a stream of subject texts far from this ("opposite" subject matter).

The statistical way of indexing can be algorithmized and automated, and now there are tools for automated statistical analysis of texts.

However, this method has not found any independent practical application in the IPS, it is used as an auxiliary tool in combination with a semantic analysis of the texts of documents.

4. The fourth type includes indexing systems controlled by a given dictionary ( thesaurus ).

The indexing algorithm is reduced to the fact that each word of the text is compared up to the base with the dictionary, the matched words are written in the AML.

In some systems, the dictionary is used as an assistant to a specialist who is engaged in text indexing.

Such systems include, for example, UDC. In others, such a dictionary is an element of the indexing algorithm: a word that is simultaneously encountered in the text and in the dictionary is written in AML. In the descriptive IPN in the UNDER

(POS) is written not the word itself, but the corresponding descriptor.

It is promising to index documents using specially designed hierarchical classifications that reflect the purposes of searching and using documents.

Such classifiers can be used as IPN in information systems of normative and methodological support of management: a hierarchical classifier combining normative and methodological documents is developed on the basis of the structure of the goals (main directions) and functions of the enterprise.

The hierarchical IPY classifier can be the basis of the system of selective information distribution (IRI): a classifier of needs of the category of workers using the IRI system is developed.

