Documentary Information Systems on the Computer...

Documentary IPS on the computer

In the 50-60's. XX century. actively developed DIPS for various purposes. Despite the fact that the first DIPS was created using outdated technical means, theoretical ideas and principles of their construction can be useful for the development and comparative analysis of modern DIPS. Therefore, we briefly characterize them (Table 6.7).

Table 6.7

DIPS computer examples



Creation date and destination

Brief description

Unterm (or system of units)

Proposed in 1951 Taube.

Its subject area is chemistry and chemical technology

The information retrieval language of the system (the alphabet of which is 26 latin letters) consisted of specialized keywords denoting the concepts of the domain named unither.

Uniterm - a keyword (usually simple) that could be supplemented by a link or an explanatory note that eliminates synonymy, polysemy, homonymy. The names used were their own, geographical and company names, special terms (in the Taube variant - chemical ones).

In the first version of the system there was no dictionary. In subsequent versions, an analysis of fixed phrases in the dictionary appeared, in which the keywords already used in the dictionary could not re-enter.

The morphological rules of the information retrieval language - corresponded to the rules of word formation of the English language. Syntactic means were missing. The indexing system refers to the type of free indexing systems. When translating to the IOS, a word-by-word replacement of the words of the indexed document with keywords was used.

Type of the criterion of the semantic correspondence - KCC - on the occurrence of the .

Unterm became synonymous with the simplest IPS without a grammar using specialized terms. Such systems can be used for subject areas in which POD and 1103 can be composed of special terms in this subject area (as is the case, for example, in chemistry, radio engineering, in areas of new special technologies, etc.).

Empty-empty (versions of PNP-2, PPP-4)

Developed in Informelektro. The area of ​​operation of the system was electrical engineering

Service modes - selective dissemination of information (PRI) and retrospective search.

The alphabet of the original IPY version consisted of 10 Arabic numerals, and the morphological rules for constructing descriptors were the rules for the formation of decimal numbers from digits. The main element of the IPY was United States descriptor and English-descriptor dictionaries, which included single words of natural languages ​​and, as an exception - word combinations. The system provided algorithmic recognition of homonymy

The system was designed to search and process secondary documents (abstracts, bibliographic descriptions, annotations) written in United States and English

The indexing system is the word system from United States and English into the system language.

KCC is used - not occurrence taking into account the basis relations & quot ;, i.e. the document is issued if, for each query in its search document image, either the request descriptor or the descriptor associated with the request descriptor with a basic relation has met.

To implement the relationships in the DIPS, PNP-2, the KCC is formulated in terms of "voids" and nonemptiness two sets (which led to the name of this DIPS): M | - a set of request descriptors that are not compared with any descriptors of the document that are compared (not coinciding and not related by any basic relations)

M 2 - a set of request descriptors that are related by inverse relations to other document descriptors.

Each of the sets is associated with a certain parameter m.

For any pair of POD-POS, you can create combinations of binary numbers, each of which will characterize the degree of the semantic correspondence between the element and the query. Of these combinations, presumably containing more relevant than irrelevant documents are selected, and echelons are formed in such a way that the probability of issuance in the first echelon is greater than in the latter.

In PPP-2, the text is issued in two echelons: Yes and Can be or not issued. In DIPS PNP-4, 4 sets are considered (i = 1, 2, 3,4) and their combination determines the 4 echelon of issuance

System Crystal

Designed for light industry. It is intended for storage of secondary documents

The information array of the system is divided into 8 thematic sub-arrays, which are assigned numbers included in the code of the input documents.

Service modes - selective information distribution (IRI), differentiated management services (DOR), retrospective search.

KCC refers to the type of criteria based on weighting factors. Extradition - echeloned, in the form of three echelons, determined by the total weight of terms.

The IEI provides 4 a role pointer.

System SYNTOL (SYNTOL = SYNTagmatic Organization Language = = language with semantic organization)

Information retrieval language SINTOL, created in 1960-1962. J. C. Gardon et al. (National Center for Scientific Research of France and Computing Center of the House of Human Sciences in Paris)

According to the authors' intention, the SINTOL system could work in various modes: both without grammar, and with grammar (simple or advanced).

IPSINTOL is a family of information languages ​​with different semantic force.

The languages ​​belonging to this family were designed in such a way that a language with a greater semantic force included entirely languages ​​with less semantic force.

The system provided for the ability to convert a query into a logical form using the not & quot ;, and & quot ;, or . & apos; functions.

The minimum syntactic unit is the syntagma - the two-place predicate xR i y. where x and y - lexical units of SINTOL, each of which belongs to one of the 4 quasigrammatical categories of this IPN, and R i is one of the four main syntagmatic relations.

Quasigrammatical categories of words: predicates - concepts that are used with words denoting physical properties and states, shape, size, time, etc .; entities - beings, bodies and objects; states are the passive properties of entities; actions - dynamic properties of entities.

Syntagmatic relations: predicative - is an asymmetric (ie oriented) relationship between two words, each of which belongs to the category of predicates; associative is an asymmetric static relation of the relationship between two concepts (subject to action, action to its object or circumstances, relation of belonging to e, inclusion with, etc.); consecutive are asymmetric relations of a dynamic type that exist between two concepts in those cases where the presence of one of them affects the state or position of the other (relationship type "cause-effect", "subject-object" etc.); Coordinate - symmetric (i.e., undirected) relations (equivalence, comparison, differentiation, etc.).

In addition to these 4 main syntagmatic relations, there are also 7 syntactic operators that join one of the members of the syntagma in order to clarify its logical role. Of these syntactic operators, 4 are intended to be used with terms that are associated with associative relationships ( instrumental , locations, goals and tag) and 3 - for use with terms that are related by the coordinates ( of the comparison , identification and differentiation)

System SMART (SMART - Salton's Magical Automatic Retriever Texts - perfect text search system

The automated document search system SMART was developed at Harvard in the 60s.

XX century. and was implemented in Harvard

The SMART system will include various types of IPNs and was used as an experimental tool for assessing the effectiveness of various semantic tools introduced into it. The system possessed a set of tools for analyzing content from various points of view through the use of methods of word matching, the use of stored dictionaries that reduce the discrepancies in the vocabulary, the use of statistical and syntactic methods for establishing links between words and concepts, and methods for constructing and analyzing word combinations. These tools made it possible to search in such a way that the search queries for which unsatisfactory answers were received were processed again under slightly modified conditions. The result was analyzed and, depending on necessity, further changes were made until the required information was provided.

Salton (Salton sometimes translated - Salton)

and Cornell Universities on IBM 7094 and IBM 360 computers. It was the first fully automated system that processed document and query texts (in English), and issued queries as close as possible to queries for search queries

From the point of view of the principles of document analysis in the SMART system, the following tools are provided:

1. The system of dividing English words into the basics and affixes. Can be used to reduce the input texts to the foundations of words.

2. A dictionary of synonyms, or a thesaurus, is used to replace significant words with concept numbers, each of which represents a class of foundations of words that are close in meaning.

3. The hierarchical structure of the concepts included in the thesaurus makes it possible for any number of the concept to find their "parent", "sons", "brothers" and many possible cross-references.

4. The methods of statistical associations that are applied to calculate the similarity coefficients between words, the foundations of words or concepts.

5. Methods of parsing allow us to recognize and use as characteristics of the contents of a document phrases consisting of several words or concepts, linked together by certain syntactic links.

6. The methods of statistical recognition of word combinations are used similarly to the previous methods of parsing on the basis of a previously created dictionary of collocations.

7. Correlation methods for comparing documents and queries. A number of different correlation methods were used, including the weights of the concepts and lengths of the texts of the analyzed documents. Provision is made for the CSS in the form of an analytic function representing the cosine of the angle between the AML vector and the POS vector

Documents entered into the computer memory and search requests are processed without any preliminary manual analysis by using one of hundreds of methods for automatic content analysis. As a result, we identify the documents that are most relevant to this search query.

The input data of the system consisted of three main classes:

• dictionaries, grammars and hierarchies. Determine the relationship between the characteristics of the input texts in English and the concepts that are ultimately used to display the content of documents and queries;

• specifications. Indicate which content analysis programs are applicable and which dictionaries should be used in each particular case. Specifications are also needed to establish an array of documents to be processed to determine the exact algorithm for comparing documents with search queries, to establish the weighting coefficients of concepts derived from the application of various analysis methods, to determine the type of output data, etc.

• Documents and search requests. Presented in various forms (either only the title, or abstracts and summaries, or the full text). The output data obtained as a result of the system operation is displayed in the form of: printed lists (including, for example, document texts in the array), lists not found when searching in word dictionaries, lists of document vectors, correlation data and responses received by the system for search requests

As the SMART system has become the most widely known, we give its main properties.

• It is believed that the information analysis operations in the system are sufficiently complete and perfect to ensure that most of the relevant materials are located in response to most searches.

• The diverse needs of individual consumers are taken into account by allowing them to choose a number of different ways of text processing and the corresponding sequence of search methods to ultimately achieve satisfactory results. Search can be performed not only as a single process, it can be repeated under control by the consumer in the form of several partial searches in the required domain.

• The system can be used as a tool for evaluating the effectiveness of various methods of automatic document analysis; In this case, the search results can be compared for the same searches in the same document pool, but with different search methods.

• The system can operate in real time, i.e. so that different consumers have simultaneous access to an array of documents.

In our country, documentary information retrieval systems were developed for all levels of the GSNTI. At the state level, an integrated information system was created, Assistant (see paragraph 6.12).

Also We Can Offer!

Ошибка в функции вывода объектов.