Hierarchical clustering
Consider the unifying (agglomerative) methods of hierarchical clustering.
Recall that in the first step of the work of unifying clustering algorithms, each object is considered a separate cluster. Each subsequent step of the algorithm consists in merging some specific clusters. In the end, all objects form one cluster. In this case, the file with the results of the algorithm is given information, the analysis of which allows you to choose the most rational from a certain point of view.
As already noted, the disadvantage of agglomeration methods is their computational complexity: the estimated time increases in proportion to the cube of the number of objects in the data matrix. In addition, these methods are very demanding on the computer's RAM. Therefore, in our experience in the SPSS program complex, at least up to and including the 18th version ( PASW Statistics ), no more than 12,000 objects (rows) are clustered data matrix).
The variety of methods of this type is related to two circumstances: first, with what measure is considered the distance between points of space and, secondly, according to what rules are the distances between clusters, when the latter include two or more objects. Measures on the basis of which the distances between points of space are determined, we will consider somewhat later. And now we will discuss the rules for determining the distance between clusters.
Rules for determining the distance between clusters. In the first step, when each object is a separate cluster, the distances between these cluster objects are determined by the selected metric - a measure of distance or a measure of the similarity of objects in the space of variables. Then, from certain considerations, which will be discussed below, some objects are combined into one cluster and clusters appear in which two or more objects appear. A problem arises: what is the distance between such clusters? Here there are various possibilities. The following three groups of methods are offered in SPSS .
1. Methods of individual links:
a) intergroup relationships;
b) intra-group relationships;
c) nearest neighbor;
d) distant neighbor.
2. Methods for coupling between cluster centers:
a) centroid clustering;
b) median clustering.
3. The dispersion method is Ward's method.
When using the methods of the first group, after the object is included in the cluster, the distances from this object to the others are still taken into account. When using the methods of the second group, after the object is included in the cluster, the distance from some central point in one sense or another that characterizes the cluster as a whole is calculated and subsequently taken into account. The method from the third group is based on a different logic: combining not the clusters that are closest in one sense or the other, but those whose merging yields the smallest increase in intracluster dispersion; in the least degree leads to "loosening" clusters identified in the previous steps of the procedure.
Consider these methods groups in more detail.
Individual Relationship Methods
Intergroup communication method. In this method, the distance between clusters is calculated by averaging all possible distances from the object of one cluster to the object of the other. By default, it is suggested to use this method in SPSS .
The method of intra-group relationships. In this method, the distance between clusters is calculated as the average distance between all possible pairs of objects belonging to both clusters, including objects located within the same cluster.
Nearest neighbor method. The distance between two clusters is the distance between the two closest points from different clusters. This method works well if the clusters in reality have the form of elongated chains.
The method of a distant neighbor. The distance between two clusters is the distance between the two farthest points from different clusters. This method works well when in reality clusters have the form of clusters of points that are far apart from each other. If the clusters are elongated or their natural type is "chained", this method is not suitable.
Cluster Center Communication Methods
Centroid clustering. In the first step, each object forms a separate cluster, the coordinates of this object are the center (centroid) of the cluster. When two clusters merge, the centroid of the new cluster is calculated as the average value of the centroids of the original clusters, weighted by the number of objects in each cluster. Thus, greater importance is attached to large clusters. As a result, at each step of the algorithm, the centroid of each cluster is located at the point with the mean values for all the cluster objects.
Median clustering. When two clusters merge, the centroid of the new cluster is calculated by averaging the coordinates of the centroids of the two merged clusters. The number of objects included in these clusters is not taken into account, i.e. small and large clusters are considered to be equally important and are counted with the same weights.
Dispersion method
The Ward method. The method minimizes the average sum of squares of Euclidean distances from cluster objects to their cluster centers. At each step, two clusters are combined that give the smallest increment in the cluster cluster dispersion.
The practice of using the methods of agglomerate clustering shows that the best results are obtained by the mean coupling method and Ward's method.
Distance measures between points in space. When we say that methods of hierarchical clustering combine close objects into one cluster, we must agree what is close and far by what rules, by what measure the distances between points in the space of variables are determined. Speech can then go both about the points in which individual objects are located, and about the points at which the centers of clusters are located.
Measures, i.e. the specific formulas used to measure distances depend on the type of scales in which the variables are measured.
Interval (metric) scales
The Euclidean distance between two points x and y is the shortest distance between them. If the space is two-dimensional or three-dimensional, then this measure is geometrically the length of a line segment connecting these points. In the case of n variables, the Euclidean distance is calculated by the formula
(13.3)
The square of the Euclidean distance is calculated by the formula
(13.4)
Compared with Euclidean distance, this measure attaches a more serious significance to large distances. If a centroid, median or Wardian method is used, it is usually recommended to use this measure. However, according to the experience of the authors, if all binary variables that are used as interval, higher-quality clustering results give a Euclidean distance.
Block. The distance of urban blocks (Manhattan distance, taxi driver's distance).
This distance is simply the sum of the absolute values of the coordinate differences. The Manhattan distance is calculated using the formula
(13.5)
If the Euclidean distance is the shortest distance between two points, in the two-dimensional case the distance along the hypotenuse of a right triangle, then the block is the sum of the lengths of the legs of this triangle. This measure is also called the Manhattan distance. It's jokingly said that this is the path that a Manhattan taxi driver must overcome in order to travel from one house to another along streets intersecting at right angles.
In most cases, this distance measure leads to the same clustering results as the usual Euclidean distance. However, we note that since here it is not provided for the construction of distances in the square, the effect of individual large differences (emissions) decreases.
The distance Chebyshev. The distance between two points is defined as the maximum modulus of the differences of their coordinates in a given space of variables:
(13.6)
This distance can be useful when you want to consider two objects very far apart, if they are very different at least for one single coordinate.
Distance Minkowski. This distance is calculated using the formula
(13.7)
As it is easy to see, the Minkowski distance is a generalization of the Euclidean distance: if r = 2, they coincide. By varying the parameter r, it is possible to assign a different value to remote points in comparison with relatively close ones. Note that, in this measure, the links both for weighting differences by individual coordinates, and for weighting distances between objects, are " one and the same parameter r.
The adjusted measure of the connection or the power distance is calculated by the formula
(13.8)
This measure is a generalization of the Minkowski measure: it uses not one but two parameters: to weigh the differences by individual coordinates - the parameter p, and to weigh the distances between objects - the parameter r. If the researcher wants to increase or decrease the weight related to the dimensions for which the corresponding objects are very different, this can be achieved using a power distance. Naturally, if both parameters - r and p are equal to two, then this distance coincides with the distance of Euclid. It is recommended to select the values of both parameters in the range from 1 to 4.
Nominal scales
To perform hierarchical cluster analysis when using scales of this type, the data must be prepared differently than described above. Previously, each row of the data table belonged to a particular respondent. Now the rows of the data table must correspond to the categories of one nominal variable, and the columns to the other. Categories of the first variable are treated as objects that need to be divided into clusters, and categories of the second are treated as variables. In the cells of the same table, there are frequencies - the number of respondents whose answers contain the appropriate combination of categories. Thus, cluster analysis of this type is performed not on the original data matrix, but on the contingency table.
Accordingly, to compare two objects-the rows of the conjugacy table (x and y) with each other, i.e. to determine the distance between them, use the frequency measures referred to in subsection 12.5: the square root of the criterion χ2 and the Ficher coefficient of Fisher.
Measure χ2. To calculate the measure, the Pearson formula is used, in which the sum of the squares of the standardized residues is calculated for all cells of the conjugacy table belonging to two specific row objects of this table, x and y. As the distance between categories, the square root of the criterion value χ2 is used:
(13.9)
where k is the cell number in the contingency table; - the observed frequency in the k-th cell (ie, for example, the number of respondents who chose this combination of answers); - the expected frequency in the k-th cell.
Cells with higher standardized residues contribute more to the numerical value of the χ2 criterion, and hence to the distance between the two row objects of the table x and y. Thus, the larger the larger standardized residues, the greater the distance between the lines.
Measure φ2. Here, when calculating the distance between two rows of the conjugacy table, the measure φ2 is normalized; Before extracting the square root, it is divided by the total sum of the observed frequencies, i.e. the total number of respondents in two lines of the contingency table:
(13.10)
In contrast to the measure χ2, which can take arbitrarily large positive values, the measure φ2 changes from zero to one. Note that although, generally speaking, this criterion is recommended for conjugacy tables of two rows and two columns, in the hierarchical clustering, meaningful results are obtained for a larger number of columns of the conjugacy table.
Binary variables
Hierarchical cluster analysis can be used to cluster not only objects (for example, respondents), but also variables. Although, as mentioned above, this application of SPSS is not considered in the book, we note that for the case of clustering binary variables in the program complex, 27 communication measures are provided, which can be used at the discretion of the researcher.
Recommendations for ensuring the reliability and reliability of the results of hierarchical cluster analysis. Hierarchical methods of cluster analysis (as well as the method K-means ) are designed to identify heterogeneities existing in space of variables. However, such heterogeneities do not have to exist in reality, whereas any method of cluster analysis always produces a result. If there are no real heterogeneities in the experimental material, the method will detect small random unevenness and, on the basis of these fluctuations, will divide the objects into clusters.
To understand whether it was possible to identify an actual structure of objects or the method ended up working because there was no other way out, it is recommended that you follow the four recommendations.
1. Perform cluster analysis using various methods of measuring distances. Compare how much the results match.
2. Perform cluster analysis using various methods of clustering. Compare results.
3. If the size of the data matrix allows, split the set of classified objects into two equal parts randomly. Perform cluster analysis separately for each half. Compare the cluster centroids of two subsamples.
4. A very important criterion for the quality of clustering is a meaningful interpretation of the results.
Example 13.10
Using hierarchical clustering
Recall that the purpose of this algorithm is to step-by-step merge of objects (rows of the data matrix) into clusters, using some measure of similarity (distance) between objects. The clustering can be carried out, for example, by one of the seven methods that we considered above. In the first step, each object is placed in a separate cluster. Let's consider how this algorithm works, first by examining students about the degree of their agreement with each of the 22 utterances (see Section 13.3). Clustering is carried out in the space of only seven parameters that strongly correlate with their own factors:
q4 - "I like having fun in good company & quot ;;
q14 - "I'm an adult and have to help my family financially";
q16 - "I'm trying to make it all the best";
q17 - Completely can not stand the queues, it is better to overpay & quot ;;
q20 - I can always easily explain myself in English & quot ;;
q23 - I can not stand when I'm trying to command & quot ;;
q24 - "I like to find a solution, navigate in an uncertain situation".
In Fig. 13.16 shows how to assign variables in the English version of SPSS 14 and in the United States version SPSS 17, in the space of which clustering will be performed. Please note that in the "Mark values ( Label Cases by )" the text variable Surname was introduced so that the diagram clearly understood which students are included in the same cluster.
Fig. 13.16. Selection of variables in the space of which clustering will be performed
Fig. 13.17. Selecting a clustering method and measure of communication
Fig. 13.18. The order of issuing the table of the sequence of agglomeration of objects into clusters
Figure 13.17 illustrates the choice of the clustering method and the communication measure. (The Ward method is chosen, and the square of the Euclidean distance is used as the communication measure.) In Fig. 13.18 shows how to order the issuance of a table with a merge sequence and information on the dynamics of the growth of the clustering quality score (in this case, recall, this is an intra-cluster variance).
In Fig. 13.19 shows how to order the output of a special tree diagram ( Dendrogram ) as the result of clustering, which is very convenient for a relatively small (as in this case) number of objects that are clustered. With a large number of objects, this graph becomes unobtrusive.
Fig. 13.19. Order for issuing a tree diagram ( Dendrogram )
After the general statistical summary of the objects, the "Agglomeration schedule table" is displayed. (Table 13.6). According to this table, in principle, it is possible to trace the order in which the clusters were merged in the course of the algorithm.
Table 13.6. Table Agglomeration schedule
Stage |
Cluster Combined |
Coefficients |
Stage Cluster First |
Next Stage |
||
Appears |
||||||
Cluster 1 |
Cluster 2 |
Cluster 1 |
Cluster 2 |
|||
1 |
17 |
19 |
, 500 |
0 |
0 |
3 |
2 |
12 |
18 |
2,000 |
0 |
0 |
13 |
3 |
1 |
17 |
3,500 |
0 |
1 |
14 |
4 |
14 |
15 |
5,000 |
0 |
0 |
7 |
5 |
3 |
11 |
6,500 |
0 |
0 |
11 |
6 |
5 |
7 |
8,000 |
0 |
0 |
11 |
7 |
4 |
14 |
9.833 |
0 |
4 |
12 |
8 |
9 |
10 |
12.833 |
0 |
0 |
9 |
9 |
6 |
9 |
15,833 |
0 |
8 |
13 |
10 |
2 |
16 |
19,333 |
0 |
0 |
15 |
11 |
3 |
5 |
23,333 |
5 |
6 |
14 |
12 |
4 |
8 |
27,750 |
7 |
0 |
16 |
13 |
6 |
12 |
33,050 |
9 |
2 |
17 |
14 |
1 |
3 |
39,764 |
3 |
11 |
17 |
15 |
2 |
20 |
48,931 |
10 |
0 |
18 |
16 |
4 |
13 |
59,581 |
12 |
0 |
19 |
17 |
1 |
6 |
75,650 |
14 |
13 |
18 |
18 |
1 |
2 |
94,667 |
17 |
15 |
19 |
19 |
1 |
4 |
124,850 |
18 |
16 |
0 |
However, in our experience, the column "Coefficients ( Coefficients )" is the most informative in this table, because with its help one can determine the number of clusters that is rational for the given case. This column contains data about the value of the clustering criterion at each step of the algorithm.
In Fig. Figure 13.20 shows the growth graph of the optimization criterion. Recall that in our case we are talking about the Ward method and the square of the Euclidean distance in the space of the original (non-standardized) variables. Therefore, the agglomeration coefficient at each step of the procedure makes sense of the total intracluster dispersion of the resulting partition.
We see that in the first step of the algorithm (when each object forms a separate cluster), the intracluster dispersion is naturally zero. Then it gradually begins to grow, since the algorithm glues in one cluster more and more distant from each other objects.
The optimal number of clusters of a partition is determined as follows: there is a moment from which an abrupt increase in its values begins. Obviously, this is the moment from which the algorithm is forced to "connect the uncombinable", collecting in a single cluster very remote relatively compact sets of objects.
Fig. 13.20. Growth graph of the optimization criterion
In this case, judging by the graph, the jump-like change in coefficients begins with the 16th stage of the algorithm. Therefore, the optimal number of clusters for our example is four (20 - 16 = 4).
With a significant sample size, the graphs of the type shown in Fig. 13.20 can not be built for all stages of the agglomeration process. It is enough to analyze the last few dozens of points in order to understand which of them starts an abrupt change in the criterion. Along with this graph, a graph of the differences in the values of the optimization criterion between neighboring points can be useful.
In SPSS there are other tools designed to make it easier for the researcher to choose the optimal number of clusters. So, by default, in addition to the table with the results of clusters formation, on the basis of which we determine their rational number, SPSS also displays a special so-called ssalt histogram Icicle , also helping, according to the plan of the creators program, determine the rational number of clusters; the diagrams are displayed by the Plots button (see Figure 13.19). However, the analysis of this diagram is very laborious even with a relatively small data file. Therefore, we will not show this diagram in the book.
In addition to the sotsulchatoy diagram, in the Plots window, we can choose a more convenient, in our opinion, tree diagram ( Dendrogram ) (Figure 13.21). It is a horizontal column that visually demonstrates the distances between objects and groups of objects (clusters). With a small (up to 50-100) number of respondents, this chart really helps to make a rational decision on the number of clusters. However, in almost all examples of real market research, the sample is much larger, which makes the dendrogram largely useless.
Fig. 13.21. The tree diagram ( Dendrogram ) (The names of the students whose answers to the questionnaire questions were used for clustering are hidden.)
The tree diagram allows you to make the following observations. The three points corresponding to students No. 17, 19 and 1 are very close to each other in the variable space. A pair of points No. 3 and 11 are closely located, as well as a pair of points No. 5 and 7. Between these pairs of points, the distance is somewhat larger. Even more distance between the first three points and the next four. At the same time, these seven points in the aggregate are very far removed from the five points No. 12; 18; 9; 10 and 6. Since the discontinuity jump here is very significant, it makes sense to conditionally dissect the tree diagram at approximately 10 points on the top of the conventional scale. Then, as it is easy to see, four clusters will be formed:
o first: # 17, 19, 1, 3, 11, 5 and 7;
o the second: No. 12, 18, 9, 10 and 6;
o the third: No. 2, 16 and 20;
o Fourth: No. 14, 15, 4, 8 and 13.
The above example makes the laboriousness of analyzing the tree-like graphs at industrial The sample size is on the order of many thousands of respondents. Therefore, our experience testifies to the choice of the optimal number of clusters on the basis of graphs built in MS Excel, according to the column Coefficients tables Agglomeration order ( Agglomeration schedule ) .
After the optimal number of clusters is selected, the hierarchical clustering procedure is started again, but beforehand on the "Save" tab ( Save ) the output of the clustering results to the number of clusters recognized by the rational number (Figure 13.22) is ordered in the file of initial data.
Fig. 13.22. The order to save the column with the cluster numbers to the file with the original data
After that, a column will be added to the file, containing for each respondent the cluster number to which it is assigned.
In conclusion, we recall that these methods, unlike the K-means method, first, at least up to the 18th version SPSS , inclusive, do not allow processing of input data matrixes from more than 12 thousand and, secondly, it is not possible to take into account the weight coefficients. The first problem is usually overcome by randomly selecting the rows to be analyzed. According to the experience of the same authors, if we randomly select rows with a probability directly proportional to the weight coefficients, then the second problem is overcome.
thematic pictures
Also We Can Offer!
Essay Writing
[...]
- Argumentative essay
- Best college essays
- Buy custom essays online
- Buy essay online
- Cheap essay
- Cheap essay writing service
- Cheap writing service
- College essay
- College essay introduction
- College essay writing service
- Compare and contrast essay
- Custom essay
- Custom essay writing service
- Custom essays writing services
- Death penalty essay
- Do my essay
- Essay about love
- Essay about yourself
- Essay help
- Essay writing help
- Essay writing service reviews
- Essays online
- Fast food essay
- George orwell essays
- Human rights essay
- Narrative essay
- Pay to write essay
- Personal essay for college
- Personal narrative essay
- Persuasive writing
- Write my essay
- Write my essay for me cheap
- Writing a scholarship essay
Case Study
[...]
Citation Style
[...]
Additional Services
[...]
CV
[...]
Assignment Help
[...]
Admission Services
[...]
Custom Paper
[...]