Mathematical methods of data classification: general idea of...

Mathematical methods of data classification: general idea of ​​cluster analysis

Another method that allows you to organize a large array of data is cluster analysis . By this term, in practice, a whole set of mathematical procedures are designated that allow the material available to the researcher to be classified. Groups of objects or variables allocated during the application of such procedures are called clusters . The initial data in the cluster analysis are the same similarity and mixing matrices as in multidimensional scaling. In fact, this method is akin to the methodology of multidimensional scaling, but it differs somewhat simpler.

To date, a large number of different algorithms for cluster analysis have been developed. They differ not only in their mathematical procedures, but in their general methodology (ID Mandel [12]).

Some of these algorithms assume the selection of groups of objects, clusters by a set of predefined properties. This approach is usually referred to as heuristic. It defines the procedures for direct classification. Heuristic procedures also include a more flexible approach, involving algorithms that provide a separate, unique set of properties for each cluster. These algorithms are called the procedures of combined direct classification.

The cluster analysis procedures differ from the first two groups of the heuristic approach, which are called the optimization direction. This direction defines the problem of selecting clusters as a proper mathematical problem.

Finally, another approximate is allocated. In this case, the data clustering algorithm is aimed at ensuring the best approximation of the data to a predetermined classification.

Within the framework of the indicated approaches, which in practice turn out to be largely overlapping, a number of different algorithms have been developed. The most developed and claimed in the framework of psychological research are hierarchical algorithms for cluster analysis. These algorithms are usually referred to as direct classification algorithms. The result of applying the algorithm of hierarchical cluster analysis is the so-called dendrogram, tree of hierarchical classification. One example of such a classification is shown in Fig. 10.2.

Fig. 10.2. Dendrogram

Hierarchical clustering is a step-by-step procedure. First, objects and variables are identified that have the most similarities with each other. As shown in Fig. 10.2, these are objects 6 and 5, and also 1 and 2. Then, at each next step to the selected clusters Other objects or variables are added until all objects are merged into one cluster. In this case, objects do not necessarily join the previously allocated groups one at a time. As shown in Fig. 10.2, at the last step, there is a consolidation of two clusters identified at the first step.

Practical examples

Factor Structure of the Semantic Differential

Evaluation of the reaction time in semantic solution problems, an example of which was considered above (see paragraph 9.5), is one of the popular methods for studying the value system. This method was developed in the 1960s. and became particularly popular in the 1970s and 1980s. in the context of research on semantic memory. Another method of investigating meanings was developed somewhat earlier - in the 1940s-1950s, and in Soviet psychology it was widely used again in 1970 th gg. in connection with the development of the field of psychology of thinking, called experimental psychosemantics . This is a method of semantic differential, developed by C. Osgood and his colleagues [25]. An important part of the processing of data obtained using this method is the factor analysis procedure.

The method of the semantic differential is quite simple. The subject is asked to evaluate a number of objects that can represent both verbal and non-verbal stimuli over a set of scales. These scales, as a rule, represent a set of qualitative adjectives. Most often used pairs of antonyms, such as cold - warm, large - small, hard - soft. A variation of this technique is the technique of the personal semantic differential, when certain personal characteristics are used as the scales of the semantic differential, and as the object - specific people, the presence of appropriate personal properties should be assessed by experts during observation or conversation.

The result of this work is the already familiar matrix of mixing. Using the procedure of factor analysis, C. Osgood [25] showed that the diversity of the assessments of the subjects, as a rule, can be described with the help of three basic factors: estimates, strengths and activity. However, due to various circumstances, the factor structure of the semantic differential in this particular case may be different.

As an example of the use of factor analysis procedures in evaluating the structure of the semantic differential, let us consider the student's work performed within the framework of the general psychological practice at the Institute of Psychology. LS Vygotsky of the RSUH.

Students were asked to choose their own assessment objects. In this case, the set of scales of the semantic differential was set by the teacher. The recommendations of A. G. Shmelev, also presented on the site of the laboratory "Humanitarian Technologies", were used. Here's the list:

1) cold - warm;

2) light - heavy;

3) slow - fast;

4) ugly - beautiful;

5) soft - hard;

6) quiet - noisy;

7) bitter - sweet;

8) small - large;

9) listless - cheerful;

10) nasty - nice;

11) flexible - elastic;

12) dim - bright.

In one of the works, objects of household utensils were chosen as objects for evaluation: an iron, a frying pan, a comb, a ruler, a saucepan, a saucer, a sponge and a string. The subject had to rank these eight objects on each of the 12 preset scales of the semantic differential. The result of this work was the mixing matrix, presented in Table. 10.4.

Table 10.4

Results of ranking eight items of household utensils on 12 scales of the semantic differential

Object

Semantic differential scales

1

2

3

4

5

6

7

8

9

10

11

12

Iron

8

8

1

1

6

6

1

6

8

1

8

2

Frying pan

7

7

2

2

8

8

4

7

4

2

7

1

Comb

5

4

8

4

3

5

2

3

3

6

3

6

Ruler

1

3

6

7

4

3

3

4

7

4

4

4

Casserole

6

6

3

3

5

7

7

8

6

3

6

8

Saucer

3

5

4

8

7

4

8

2

5

5

5

7

Sponge

4

2

7

6

1

2

5

1

1

8

1

3

Thread

2

1

5

5

2

1

6

5

2

7

2

5

A brief glance at these results is enough to make sure that estimates for a number of scales for different objects can largely coincide. So, we see that on the scales 1, 2, 9 and 11 "iron" Gets rank eight, i.e. This object is at the end of the ranking list, while the ranks for the thread output this object almost to the very top on these scales. Thus, it can be assumed that the variance of the data for these four scales is in fact determined by the significant weight for them of one and the same feature hidden from direct observation. As we know, such characteristics are called factors.

Let's try to reconstruct the factorial structure of the semantic differential for the results presented in Table. 10.4. It is understood that manually this is extremely difficult. Therefore, we again use the statistical package IBM SPSS Statistics. This package in the standard version contains a module that provides procedures for the main factor.

As usual, we begin with the definition of variables. This step is no different from the steps that we described in paragraph 9.6 when we considered how variables for regression analysis are defined in IBM SPSS Statistics. Go to the Variables and enter the names of our 12 scales of the semantic differential. Since there are restrictions in the statistical package on variable designations, in particular, it is not allowed to use a space in the name, in the "Name" field, specify short names, and in the Tags field, - their full designations (Figure 10.3).

Defining variables for factor analysis in IBM SPSS Statistics

Fig. 10.3. Define variables for factor analysis in IBM SPSS Statistics

Now go back to the Data and enter the results of the evaluation of our objects for all variables from Table. 10.4. Thus, we obtain a mixing matrix, the content of which will be subjected to factor analysis (Figure 10.4).

Data for factor analysis

Fig. 10.4. Data for factor analysis

Now in the main menu, select the Analysis & quot ;, then Diminishing and Factor analysis ... & quot ;. A window for setting up factor analysis appears (Figure 10.5).

Window for setting factor analysis in IBM SPSS Statistics

Fig. 10.5. The Factor Analysis Settings window in IBM SPSS Statistics

In the left part of this window there is a list of variables that you can select. To the right of this field we see the "Variables" field, where you need to transfer those variables that will be used for factor analysis. Let's transfer all our variables here. Below this field is a field of a variable whose values ​​can be used to select observations. Since we want to factorize all data, we leave this field free. In addition, we pay attention to the possibility of additional fine-tuning of factor analysis.

So, the Descriptive ... button enables us to derive additional statistics, in particular the correlation matrix.

Extraction ... button allows us to choose the variant of the main factor procedure and thereby determine the model of our data. By default, the procedure for the main components is established here, which, as we recall, defines the basic complete component model of factor analysis. This option is used most often and if you do not have special reasons to use another procedure, you should leave it. Also, clicking the Extract ... button allows you to specify the criteria for extracting factors: based on eigenvalues ​​(latent roots) or based on a predetermined number of factors. By default, the first option is used, in which factors, eigenvalues, are eliminated. the values ​​of the latent roots, for which turn out to be less than one. This criterion can also be redefined.

More important is the Rotation ... button. Selecting this button allows you to set the procedure for the subsequent rotation of factors. Let's choose varimax as an example. This is the most often used in psychological research rotation.

When you have finished configuring the factor analysis, click OK & quot ;. An output window appears, which may contain a different set of results, depending on which settings we have selected. The most important are the results presented in the tables Complete Explained Variance & quot ;, Component Matrix and a similar "Matrix of rotated components".

Table Full Explained Variance is important for understanding how many factors have been identified as important and what percentage of the variance they explain individually and collectively. Let's consider it in more detail (Table 10.5).

The first column of the table lists the selected components, their 12 - by the number of variables being investigated. Next we see the values ​​of the initial eigenvalues, otherwise called latent or characteristic roots. The column containing such data is divided into three parts. In the extreme left part the values ​​of the latent roots are indicated. Since the full component model is used, the sum of these values ​​is equal to the number of variables, i.e. 12. To the right, percentages of the variance are indicated, which are described by each of these components identified in the course of the factor analysis, and next are the accumulated values ​​of these percentages. As can be seen, three significant factors are identified in the analysis of the main components, the latent roots being greater than one. These three factors account for about 86% of the variance of the data. Column The squares of load squeezes summarizes this result.

Table 10.5

Latent root values ​​and explained variance in the results of factor analysis in IBM SPSS Statistics Full Explained Variance

Component

Initial eigenvalues ​​

The sum of the squares of the load of extraction

The sums of squares of rotational loads

Total

% Dispersions

Cumulative%

Total

% Dispersions

Cumulative%

Total

% Dispersions

Cumulative%

1

7,457

62,139

62,139

7,457

62,139

62,139

4,558

37,983

37,983

2

1,789

14,905

77,044

1,789

14,905

77,044

3.849

32,075

70,058

3

1,041

8,677

85,721

1,041

8,677

85,721

1.880

15,663

85,721

4

0.729

6,077

91,798

5

0.646

5,380

97,178

6

0.269

2,240

99,418

7

0.070

0.582

100,000

8

6.037E-016

5.031Е-015

100,000

9

1.569Е-016

1.307Е-015

100,000

10

-1.019Е-016

-8.492Е-016

100,000

11

-3.031Е-016

-2.526Е-015

100,000

12

-6.007Е-016

-5.006Е-015

100,000

Selection method: Analysis of the main components.

The last column reflects the same results after applying the varimax rotation procedure. As can be seen, there was a redistribution of the values ​​of latent, or characteristic, roots. As a result, the weights of the second and third factors increased somewhat, and the first - decreased. Let's also pay attention to the fact that if earlier the first factor described more than 62% of the total variance, the second one - about 15%, and the third - less than 9%, then after rotation the weights of the first and second factors are approximately equal, so they both describe approximately 35% of the total variance, and the third factor is more than 15.5% of the variance of the data.

The factor matrix obtained for the application of the principal component procedures for seven variables by three factors, the latent roots for which were found to be higher than one, is presented in Table. 10.6. The fact is that the remaining variables are redundant and do not allow for factor analysis, and therefore they were discarded, so that in fact the analysis of the main components was carried out only for these variables. This is clearly reflected in Table. 10.5: seven components describe almost 100% of the total variance, the contribution of the remaining components is at the level of the computer's computational error.

Table 10.6

The matrix of factor loads before rotation

Matrix component *

Variable

Component

1

2

3

Cold - warm

-0.862

-0.194

-0.035

Light - heavy

-0.309

0.797

0.392

Soft - hard

0.804

0.379

-0.155

Quiet - noisy

0.975

0.168

-0,133

Small - large

0.757

0.097

0.305

Sluggish - cheerful

0.694

0.275

-0.554

Dull - bright

-0.306

0.728

0.252

Selection method: Analysis of the main components. * Extracted component: 3.

Let's pay attention to the third factor in this table of factor loads. As is evident from Table. 10.6, all weights for this factor are extremely low. A relatively high load can be marked only for the variable "sluggish - vigorous". However, this variable simultaneously has an even higher load on the first factor, which makes it difficult to interpret. Therefore, it is worth looking at the results obtained after applying the varimax rotation procedure. These data are presented in Table. 10.7.

Table 10.7

The matrix of factor loads after rotation

Matrix of rotated components ''

Variable

Component

1

2

3

Cold - warm

-0.691

-, 552

-0.010

Light - heavy

-0.087

-, 087

0.932

Soft - hard

0.831

0.338

0.101

Quiet - noisy

0.851

0.510

-0.109

Small - large

0.428

0.698

.065

Sluggish - cheerful

0.919

-0.002

-0.143

Dull - bright

-0.039

-0.171

0.810

Selection method: Analysis of the main components.

Rotation method: Varimax with Kaiser normalization. 'The rotation has converged in 6 iterations.

As you can see, the first factor after rotation has the highest load on the variables "soft - hard" (0.83), quiet-noisy (0.85) and "sluggish - cheerful" (0.92). In other words, at one pole of this factor are such characteristics as "soft", "silent" and "sluggish", and on the other - "hard", "noisy" and cheerful & quot ;. Apparently, this is an activity factor, assumed in the standard procedure of the semantic differential. The second factor, apparently, is set by the opposition "small" - large & quot ;. Obviously, this is a factor of size. Finally, the third, least significant factor has maximum loads on the scales "easy - heavy" (0.93) and dim - bright (0.81). Thus, he asks the opposition "easy" and "dull", on the one hand, and "heavy" and bright - with another. Taking into account the metaphorical procedure for evaluating objects but the method of the semantic differential, we can say that this is a weight factor. It can also be interpreted as a force factor, which is also assumed to be the basic assumptions of Ch. Osgood's theory.

Thus, after applying the procedure of rotation of factors, we obtained three factors, which in order of decreasing importance can be interpreted as activity , /strong> and Weights ( Forces ). This only partially coincides with the standard factor structure of the semantic differential described by C. Osgood [25]. This fact can be explained by the features of the objects that were used for estimation. It is possible, however, that this is also due to the peculiarities of the procedure for isolating factors.

The fact is that historically, a factor analysis procedure was previously proposed, known as the centroid method. Its author is L. Thurstone (27). This procedure refers to multiple group procedures for factor analysis, which are extensions of the diagonal procedure. It was this method that was first used by C. Osgood in constructing a semantic differential.

The centroid method can give interesting results in a situation where we have a large number of weakly correlated variables. And although in our case we are dealing with only 12 variables, the correlations between them are really not very high.

Unfortunately, the statistical package IBM SPSS Statistics ns gives us the ability to use the centroid method. In this regard, we turn to the help of the statistical package STATISTICA.

Working with this package is slightly different from working with IBM SPSS Statistics. After starting the program, you need to create a data file or import it from another program, for example from a file created by MS Excel or IBM SPSS Statistics.

By default, after starting the program, a data file is created that contains ten variables and ten observations. For us it does not quite fit. Remove two rows of observations and add two variables. Then double click on the name of the first variable - by default it is called Var1 . Thus, we go to the variable definition window (Figure 10.6).

Specify the name of our first variable in the Name field. In the Long Path field, This field is similar to the Tags in IBM SPSS Statistics. Next, click the & gt; & gt; to go to the definition of the next variable and so we give the names to all our 12 variables. After that, click OK .

Now we can enter the results of our measurements on 12 scales of the semantic differential (Figure 10.7).

Window for defining variables in STA TISTICA

Fig. 10.6. The variable definition window in STA TISTICA

Mixing matrix for 12 variables and eight objects in STATISTICA

Fig. 10.7. Mixing matrix for 12 variables and 8 objects in STATISTICA

After all the data is entered, go to the Statistics tab and select Mult/Exploratory and in the opened list -

Factor. The first window for setting up factor analysis appears. At this step, we need to specify the variables for which the correlation matrix will be constructed (see Figure 10.8). To select variables, simply press the Variables (Variables) button, and in the window that appears Select AN (Select All) . After that, close the window for selecting variables by clicking OK and return to the first window of the factor analysis setting. We indicate in the corresponding field what our data represent - the mixing matrix Raw Data or the correlation matrix Correlation Matrix. In our case, we choose the mixing matrix. Click OK .

Choosing variables for factor analysis in STATISTICA

Puc. 10.8. Select variables for factor analysis in STATISTICA

The second window for configuring the factor analysis appears (see Figure 10.9). This is the main window that allows us to choose a procedure for factor analysis, determine the criterion for selecting factors and derive a correlation matrix.

To select a factor analysis procedure other than the default master component method, go to the Advanced tab, as shown in Fig. 10.9. Select Centroid method - the centroid method. We set the number of factors that we want to isolate, and the minimum eigenvalues. It is recommended to increase the value of the maximum number of factors, since STATISTICA displays the results only for the number of factors specified in this window. In addition, when selecting the centroid method, the fields of the number of iterations become active to estimate the communities and the minimum change in the communities at the next iteration step. As a rule, the default values ​​are quite acceptable and sufficient.

Main window for setting factor analysis in STATISTICA

Fig. 10.9. The main window for setting factor analysis in STATISTICA

After we click "OK", a warning window may appear that the data we have is not entirely suitable for factor analysis and therefore the correlation matrix has been adjusted. This is what we will observe in the case of an analysis of our data. Click OK and get to the results window (Figure 10.10).

This window contains five tabs. The Quick tab provides the main analysis results. Usually the opportunities that it provides are quite enough. The researcher is given the opportunity to view the eigenvalues ​​- Eigenvalues, ie. values ​​of latent roots, matrix and two-dimensional graphs of factor loads for the selected factors, as well as specify the procedure for the rotation of factors.

First, look at your own values ​​by clicking the appropriate button. However, this can not be done, since the values ​​of the latent roots themselves are already reflected in the brief summary at the top of the results window.

Window for the results of factor analysis in STAT1STICA

Fig. 10.10. The Factor Analysis results window in STAT1STICA

It is seen that two factors are distinguished. The latent root value for the first factor is approximately 6.98, for the second factor it is 1.69. If we still press the eigenvalue button, we see that the first factor describes slightly more than 58% of the total variance, the second - slightly more than 14%. In sum, these two factors account for 72.28% of the variance. As you can see, all these values ​​turn out to be somewhat smaller than those that were noted by us in the case of the application of the principal components procedure both in the number of factors and in the values ​​of the eigenvalues ​​and the variance they explain.

Now we can look at factor loads of all our variables before rotation (Figure 10.11) and after rotation, applying, as in the last time, the varimax procedure (Figure 10.12).

First, we consider the results of the original factorization. Nine variables show high loads but the first factor and one - for the second.

Since the first factor provides high loads for a fairly large number of variables, to interpret it, select the variable that has the largest weight factor for this factor. These variables are "disgusting - pleasant" and "compliant - elastic", and they have different signs, so that an opposition is formed: "pleasant" and compliant against the contrary and elastic & quot ;. Most likely, this is a factor of pleasure (evaluation).

Weights for two factors identified by the centroid method prior to rotation

Puc. 10.11. Weights for the two factors identified by the centroid method before rotation

Weights for two factors identified by the centroid method after rotation (varimax method)

Puc. 10.12. Weights for two factors identified by the centroid method after rotation (varimax method)

As it is obvious, the only variable that has a high load on the second factor is the variable "bitter-sweet". It turns out that this is a factor of taste. However, it should be remembered that we are talking about items of household utensils, which we usually do not taste. Therefore, let us turn to the results of the estimation of our objects, represented in the mixing matrix (see Table 10.4). It can be seen that the most "sweet" there were objects saucepan and saucer & quot ;. These items are related to food and, consequently, to the taste sensations. Thus, our assumption is confirmed.

As a result of applying the varimax rotation procedure, a significant redistribution of the weight characteristics of the two factors does not actually occur. The first factor is still dominant, describing about 52% of the total variance. The second factor now describes about 20% of the variance. It is not surprising that the general factor structure practically does not change. The first factor, as before, is described by the variables "nasty - pleasant" and supple-elastic. The second factor again gives the maximum load of the variable "bitter-sweet". However, another variable "ugly - beautiful" is added to this variable. This, apparently, somewhat complicates the interpretation of the resulting factor structure. Therefore, you should return to the original solution or try using some other rotation procedure.

In Fig. Figure 10.13 shows the matrix of factor loads after applying the rotation procedure of equamax. As we recall, this procedure combines the procedures of rotation of the quartax and varimax. We see that the conclusions concerning the initial factor structure are confirmed.

Weights for two factors identified by the centroid method after rotation (equamax method)

Fig. 10.13. Weights for the two factors identified by the centroid method after rotation (equamax method)

To summarize, one should pay attention to the extent to which the solutions obtained by the two varieties of factor analysis turned out to be different. This indicates that the exploratory variant of factor analysis is primarily just a convenient tool that allows the researcher to present the data available to him in a more compact form, and does not at all reveal to him the new reality of a psychic organization, as is sometimes assumed.

thematic pictures

Also We Can Offer!

Other services that we offer

If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

How to ...

We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)