Infonometry patterns and their application for the investigation...

Infrometry patterns and their application for the investigation of information flows

The concept of informetry

The term inometry was introduced in the early 80's. XX century. by analogy with naukometry, bibliometry for a brief description of quantitative methods of research of scientific and technical information. This term was most fully disclosed by VI Gorkova [6].

Informality patterns determine the distribution of information in documentary information flows (DIP), quantitative and qualitative parameters of the organization of frequency dictionaries, the use of words in the texts of documents. DIPs form official, periodic and continuing publications and other published and unpublished documents of scientific and technical information. 1

The first results of studies of the linguistic regularities of natural language were obtained J. Estu ( J. B. Estoup , 1916), A. The tray ( A.J. Lotka , 1926).

The characterization of the qualitative properties of frequency dictionaries was determined in 1916 by G. Esta, who found that the frequency of using words in the text is inversely proportional to his number in the frequency dictionary.

The laws of informality have been most investigated. Zipf , B. Maldenbrot , S. Bradford, B. Vickery.

Zipf's Laws

George K. Zipf ( J. Zipf , or in some modern translations - Mr. Zipf ) in the early 30's. XX century. on the basis of statistical research has received the following pattern.

Suppose there is a text with a length of N words and a dictionary of words t with the frequency of the appearance of the word in the text. The words in the dictionary are arranged in descending order of their frequency and are ranked from 1 to m. A rank equal to 1 is assigned to the word whose frequency of occurrence is greatest; rank equal to t, - the least common word. Then:

where p ri - the relative frequency of occurrence of the word in the text; fr i is the absolute frequency of occurrence of the word r i rank in the text of a certain length; N - the number of words in the text; r i is the rank of the word, where 1 ≤ i ≤ ​​t.

If we multiply the probability or the relative frequency of word detection in the text by the rank r i of the word, then we get:

where k - is a constant; 1 ≤ r i ≤ t.

If you convert the formula, you get: , i.e.

a function of the type y = k/x, whose graph is an equilateral hyperbola.

Thus, on the basis of the analysis of the obtained dependences, Zipf proposed an empirical formula that establishes a connection between the frequency of appearance of words in the text and its rank in the dictionary:

where k - is an empirically determined constant that varies for different texts.

Here 1 ≤ r i ≤ t; is the frequency of the most commonly used word; p m - frequency of the least used word; p r i = cp ( ri ) - "hyperbolic ladder", because the rank distribution has a step-by-step character a series of words appear with the same frequency), but with the approximation one can consider the Zipf distribution as a hyperbola (Figure 4.9).

Zipf's First Law

Fig. 4.9. First Zipf law

The value of the constant in different languages ​​is different, but within the same language group it remains unchanged, no matter what text we take. Thus, there are studies showing that, for example, for English texts the Zipf constant is approximately 0.1; and for the United States language - about 0.06-0.07.

Therefore Zipf also cited this law in the form

where k = 0.1 (for natural languages).

Based on experimental data collected as a result of statistical research of many texts in various languages, Zipf also discovered that the distribution of words of a natural language obeys a single simple law, which he called the "least effort" principle: expressing thoughts with the help of language, we are subject to the action of two opposing forces - the power of unification and the power of diversification, manifested, on the one hand, in the need to be understood, and on the other hand, the desire to express a thought more briefly.

Zipf found that the frequency and the number of words entering the text with this frequency are related. If we construct the dependence of the number of words in a given frequency on the occurrence frequency of the word, we obtain a curve similar to that in Fig. 4.8, which will retain its parameters for all texts created by man, with some deviations for different natural languages ​​(Figure 4.10).

The Second Zipf Law

Fig. 4.10. The Second Zipf Law

This regularity is sometimes called the second Zipf law.

Studies have shown that the most significant words lie in the middle part of the hyperbola (see Figure 4.9). Words that come across too often are mostly prepositions, pronouns, in English - articles, etc. Rarely encountered words, too, in most cases do not have a decisive semantic meaning.

How the range of significant words will be exposed depends on the properties of the information retrieval system.

If you apply a wider range, the necessary terms are drowned in a sea of ​​auxiliary words; if you set a narrow range - you can lose the semantic terms. In each search engine, this problem is solved in its own way, taking into account the total volume of the text, special dictionaries, etc.

Thus, Zipf's laws reflect some common property inherent in different languages. This property is that in every text in any natural language there is a certain amount of the most commonly used words. And the number of these words is much less than the total number of words used in the text.

Zipf laws are universal. In principle, they are applicable not only to texts.

In a similar form results, for example, the dependence of the number of cities on the number of residents living in them. The characteristics of the popularity of nodes on the Internet - also meet the laws of Zipf.

Zipf's laws also appear in the study of documentary information flows (DIP). In this case, the first Zipf law is represented by the absolute frequency of the occurrence of words:

where f i is the absolute frequency of appearance of the word in the texts of the documentary stream; r i - rank of the word in the rank distribution; C is the frequency of occurrence of a word of the 1st rank, which for a given DIP can be considered an empirical constant.

thematic pictures

Also We Can Offer!

Other services that we offer

If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

How to ...

We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)