Statistics Laboratory, Mathematics Department,
The Faculty of Sciences, University of Jember

Topic: Cluster and Cluster Validation

By: IM. Tirta & D. Anggraeni (2015), Revised and Updated 2022

Objectives

The objectives of this tutorial are so that readers are able to:

1. utilize validation graphics and validation scores to choose available clustering methods to get a better results of the analyses;

2. do more detail clustering analysis of the chosen method;

3. check more easily if two clustering dendogram are really different each other

4. give helpful visualization of the clustering results

Subtopics

1. Clustering & Various Clustering Methods

2. Cluster Validation

3. Practices with R

4. Reading Sources

Clustering & Various Clustering Methods

In general, the aim of cluster analyses are to identify unobserved grouping in data. They involve various combination of algorithms, distance measures and linkage methods, which frequently lead to different results. Therefore further criteria (validations) are needed to get a better or the best clustering results among the combinations.

R by RDCT (2022) has a wide variety of clustering algorithms available in the base distribution and various add-on packages. There are at least a total of nine (9) algorithms (namely Kmean, PAM, Model based, hirarchical, diana, agnes, fanny, SOM and SOTA). These algorithms spread out in various R packages including cluster (Maechler et al.,2013;Kauffmann & Rousseeuw, 1990) kohonen (Wehrens & Buydens, 2007), mclust (Fraley & Raftery, 2002; Fraley et al), and clValid package (Brock et al., 2008). Description of each clustering algorithms and their availability area described there and are sumarized below (For detail explanations and examples of most of the methods see Wehrens 2011).

K-means

K-means minimizes the within-class sum of squares for a given number of clusters (Hartigan & Wong, 1979). The algorithm starts with an initial guess for the cluster centers, and each observation is placed in the cluster to which it is closest. The cluster centers are then updated, and the entire process is repeated until the cluster centers no longer move. K-means is implemented in the function kmeans(), included in the base distribution of R. This kind of partitional clustering algorithms have been recognized to be better suited for handling large document datasets than hierarchical ones, due to their relatively low computational requirements (Cutting, et al. 1992; Larsen and Aone, 1999; Steinbach, et al. 1997)

PAM (Partitioning Around Medoids)

Partitioning around medoids (PAM) is similar to K-means, but is considered more robust because it admits the use of other dissimilarities besides Euclidean distance. Like K-means, the number of clusters is determined in advance, and an initial set of cluster centers is required to start the algorithm. PAM is available in the cluster package as function pam().

Diana (DIvisive ANAlysis)

Diana is a divisive hierarchical algorithm that initially starts with all observations in a single cluster, and successively divides the clusters until each cluster contains a single observation. Along with SOTA, Diana is one of a few representatives of the divisive hierarchical approach to clustering. Diana is available in function diana() in package cluster .

Fanny (Fuzzy ANalYsis

This algorithm performs fuzzy clustering, where each observation can have partial membership in each cluster (Kauffmann & Rousseeuw, 1990). Thus, each observation has a vector which gives the partial membership to each of the clusters. A hard cluster can be produced by assigning each observation to the cluster where it has the highest membership. Fanny is available in the cluster package (function fanny()).

SOM (Self-Organizing Map)

More detail on SOM click here ! SOM

Self-organizing maps (SOM) is an unsupervised learning technique that is popular among computational biologists and machine learning researchers. SOM is based on neural networks, and is highly regarded for its ability to map and visualize high-dimensional data in two dimensions. SOM is available as the som() function in package kohonen (See also Kohonen, 1997).

Model based clustering

Under this approach, a statistical model consisting of a finite mixture of Gaussian distributions is fit to the data (Fraley & Rafter, 2002). Each mixture component represents a cluster, and the mixture components and group memberships are estimated using maximum likelihood (EM algorithm). The function Mclust() in package mclust implements model based clustering.

SOTA

Self-organizing tree algorithm (SOTA) is an unsupervised network with a divisive hierarchical binary tree structure. It was originally proposed by Dopazo and Carazo (1997) for phylogenetic reconstruction, and then applied to cluster microarray gene expression data in (Herrero, et al.) It uses a fast algorithm and hence is suitable for clustering a large number of objects. SOTA is included with the clValid package as function sota().

Cluster Validation

The most common cluster validation techniques are based on one of the following three criteria: external indices, internal indices and relative indices, which are associated with the respective clustering structures, known as partition based, hierarchical and individual clustering (Oksanen, 2010). The clValid package offers three types of cluster validation, namely "internal", "stability", and "biological". At this stage,only two types, i.e. the internal and stability, are implemented in the GUI-web sas indicator to measure the "goodness" of clustering methods.

Internal Validation

Internal validation consists of three measures which are Connectivity, Silhouette width, and Dunn index. The internal measures reflect the compactness, connectedness, and separation of the cluster partitions. Connectedness relates to what extent observations are placed in the same cluster as their nearest neighbors in the data space, and is here measured by the connectivity. Compactness assesses cluster homogeneity, while separation quantifies the degree of separation between clusters. Since compactness and separation demonstrate opposing trends (compactness increases with the number of clusters but separation decreases), popular methods combine the two measures into a single score.

The Silhouette width is the average of each observation's Silhouette value. The Silhouette value measures the degree of confidence in the clustering assignment of a particular observation, with well-clustered observations having values near 1 and poorly clustered observations having values near -1.The Dunn index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. Silhouette() function is available in package cluster.

Stability validation

The stability validation compare the results from clustering based on the full data to clustering based on removing each column (variable), one at a time. The included measures are the average proportion of non-overlap (APN), the average distance (AD), the average distance between means (ADM), and the figure of merit (FOM). In all cases the average is taken over all the deleted columns, and all measures should be minimized.

The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed. The APN is in the interval [0, 1], with values close to zero corresponding with highly consistent clustering results. The AD measure computes the average distance between observations placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.

The ADM measure computes the average distance between cluster centers for observations placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed. Currently, ADM only uses the Euclidean distance. It also has a value between zero and , and again smaller values are preferred. The FOM measures the average intra-cluster variance of the observations in the deleted column, where the clustering is based on the remaining (undeleted) samples. The measure of cluster validation are summarized in Table 1.

Biological validation

n Biological validation is provided especially for microarray data where observations correspond to gene. Two biological validation measures available, the biological homogeneity index (BHI), measures the homogeneity of the clusters and biological stability index (BSI), measures the stability of the cluster. For annotation of the data we apply 3 main packages from bioconductor, namely: Biobase, annotate, and GO.

Practices with R

In this session, readers are given oppurtunities to practice using R (without worrying the program). Attention: Readers should make proper choices in every stage with orange background. Please consult the referrences in related field for interprating the results.

Data Activation

Pada bagian ini anda dapat memilih beberapa data yang tersedia, serta menampilkannya secara lengkap atau hanya statistika ringkasnya.

Pilihan Data

Data (Bio) are type of 'microarray' genetic data which are suitable for validation using biological. For Import Data, spesify the file:
Header: , Separator: , Quote:
Representation of Data Output 1. Complete/ Summary of Data

2. Validation Eksploration

After seeing the information about data, we can choose variables for further analyses. We explore cluster validation based on clValid package from Brock et al., 2008). In this stage the examined methods are "kmeans", "pam", "model", "hierarchical". However for detail analyses, we also provide other methods namely: diana, agnes, fanny, sota, SOM. So all together there are 9 methods of clusterings can be done here.

Choosing variable & Methods of Validation

The majority of cluster analyses only analize numerical variables (not factors). Choose variable with numerical scale (not factors). Brock et al., (2008) provide 3 validation methods, all can be applied here, these are "internal", "stability","biological". Choose 1 in turn and see the results (numerical and graphical). Components and criteria for good cluster ara given in Table 1.
Notes: Computation of biological validation take more time than others (more than 3 $\times$ those of internal & Stability), so it is advized that user examine "internal" and "stability" first, before examining the "biological" measure and "biological" measures only for specific type of data (eg 'microarry', genetic data)

Variabel (numerik)

Validation Method

Table 1. Summary of Validation Criteria using clValid

Types of Validation	Components	Value	Criteria
Internal	Connectivity	$[0,+\infty]$	MINimized
	Silohette	$[-1,+1]$	MAXimized
	Dunn Index	$[0,+\infty]$	MAXimized
Stability	APN (average proportion non-overlap)	$[0,1]$	MINimized
	AD (average distance)	$[0,+\infty]$	MINimized
	ADM (average distance between means )	$[0,+\infty]$	MINimized
	FOM	$[0,+\infty]$	MINimized
Biological	BHI	$[0,1]$	MAXimized
	BSI	$[0,1]$	MAXimized

Visualization of Validation Results

The validation results are presented in the form of grahics and numerical results.

Graphical visualization of clValid results

Fig. 1. Validation of four (4) cluster algorithms

Validation scores

Two type of output are available the complete output and the summary which only give the name and the score of suggested (indicated) 'best' clustering results.

The type of Validation results

The summary of various validation scores, suggest the most appropriate method and number of cluster. We then can proceed to the detail of some chosen methods. If the suggested method and number of cluster is not unique, then we use graphical visualization of the various method to ensure us the most appropriate number of cluster for the current data.

3. Some More Detail Analyses with R

After having idea about appropriate cluster method and number of cluster, user can analyse further for focusing on the methods. For theory of clustring user can read Everitt et al. (2011).

KMean

In using KMean, user must determine the number of cluster before analysing data. KMean only use Euclidean distance, for calculation KMean can use one of 3 (three ) algorithms (see Maechler et. al, 2015).

Algorithms

Number of Cluster

Output (KMeans Centers)

Graphics

PAM (Robust K-Mean)

PAM (Partition Arround Medoid) is sometimes called Robust K-Mean and in addition to euclidean distance it can also utilizes other distance (such as manhattan)

Distance

Output (Robust KMeans Medoids)

Graphics

Model Based

Selected Output

Select the required output

Selected output

Graphics

Hierarchical Clustering (H-clust, Diana, Agnes)

Valid distance for Agnes only 'euclidean' or 'manhattan'

Algorithm for hierarchical clustering

Distance:

linkage methods

Output: Ordered

Fig. Cluster Dendrogram

Fig. Cluster Dendrogram with Horizontal Layout
For hierarchical clustering with Agnes, the analysis can be combined with PCA applying function HCPC in FactoMineR package (Husson, 2015).

Fig. Cluster Dendogram with PCA

Self Organizing Map (SOM), Fanny & SOTA

Select new setting for number of cluster and variables. N-Klaster and variabel for SOM

Fanny (Fuzzy ANalYsis)

Fig Example of Fanny Graphics

SOTA (SELF ORGANIZING TREE ALGORITHM)

We can choose one of two provided distance measures
Using the above choices (number of cluster and distance), the result using SOTA are summarized as follows:

Fig Example of SOTA Graphics From the graphics can be easily recognized if the cluster already homogeneous enough, or should be divided into more cluster. We can verify that the most homogeneous cluster is one with the smallest diversity.

SOM (SELF ORGANIZING MAP)

Two type of SOM are provided. The first is SOM from MASS package and the second is som from kohonen package For SOM, if encounter error, reduce the number of variables involved (sample size must be much higher than number of variables !!!)

Graphics of Simple SOM

Fig Example of Simple Graphics of SOM

Graphics of kohonen SOM

Fig Example of kohonen SOM Graphics Some available SOM Graphical Types

Detail of Selected som Output

We can also print the detail of some output from som analysis.
Select the required output

The detail of selected output are printed bellow:

Reading Sources & Referrences :

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. URL http://www.R-project.org/.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2013). cluster: Cluster Analysis Basics and Extensions. R package version 1.14.4.
Kaufman, L. and P. J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, 1990.
Wehrens, R. and L.M.C. Buydens, Self- and Super-organising Maps in R:the kohonen package J. Stat. Softw., 21(5), 2007
Fraley, C. and Adrian E. Raftery (2002) Model-based Clustering, Discriminant Analysis and Density Estimation. Journal of the American Statistical Association 97:611-631
Fraley,C. Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation Technical Report No. 597, Department of Statistics, University of Washington
Brock,G. Pihur,V., Datta, and S. Datta, "clValid: An R Package for Cluster Validation", Journal of Statistical Software, Vol. 25, Issue 4, March 2008, URL http://www.jstatsoft.org/
Hartigan, J.A., and M. A.Wong. A k-means clustering algorithm. Applied Statistics, 28:100{108, 1979.
Cutting,D.R., J. O. Pedersen, D. R. Karger, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR, 1992.
Larsen B. and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Steinbach,M., G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
Kohonen, T., Self-Organizing Maps. Springer-Verlag, second edition, 1997.
Dopazo J. and J. M. Carazo. Phylogenetic reconstruction using a growing neural network that adopts the topology of a phylogenetic tree. Journal of Molecular Evolution, pages 226{233, 1997.
Herrero,J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Canadian Journal on Artificial Intelligence, Machine Learning and Pattern Recognition Vol. 1, No. 3, July 2010, pp. 26-41.
Kind, R. & Cole, R. 2005.Chapter 4. Analysis of Species Richness. Tree Diversity Analysis. Worl Agroforestry Center
Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta. 2008. clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1-22. URL http://www.jstatsoft.org/v25/i04/.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.2015. cluster: Cluster Analysis Basics and Extensions. R package version 2.0.1.
Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens and Helene Wagner. 2015. vegan: Community Ecology Package. R package version 2.2-1. http://CRAN.R-project.org/package=vegan
Husson, F., Josse,J., Le,S. and Mazet, J. 2015. FactoMineR: Multivariate Exploratory Data Analysis and Data Mining. R package version 1.29. http://CRAN.R-project.org/package=FactoMineR
Everitt, B.S., Landau, S., Leese, M., Stahl, D. 2011. Cluster Analysis (5th edt). Willey
Le, S. Pages, J. 2011. Exploratory Multivariate Analysis by Example Using R. CRC Press
Wehrens, R. 2011. Chemometrics with R: Multivariate Data Analysis in the Natural Sciences and Life Sciences. Springer-Verlag: Berlin Heidelberg

Appendices: Distance Measures

Definisi beberapa jarak yang dipergunakan adalah sebagai berikut ini. (Oksanen, J., Smith, T., Bedward,M. dalam Paket Vegan, fungsi vegdist() pada R .

euclidean: $$ d[jk] = \sqrt{\sum{(x[ij]-x[ik])}^2}, \text{ binary:} \sqrt{(A+B-2J} $$

manhattan: $$ d[jk] = \sum\left(|x[ij] - x[ik]|\right), \text{ binary:} A+B-2J $$

gower $$ d[jk] = (1/M) \sum\left(\frac{|x[ij]-x[ik]|}{\max(x[i])-\min(x[i])}\right), \text{ binary:} (A+B-2J)/M,$$ where $M$ is the number of columns (excluding missing values)

altGower $$ d[jk] = (1/Nz) \sum(|x[ij] - x[ik]|), \text{ binary:} (A+B-2J)/(A+B-J)$$ where $Nz$ is the number of non-zero columns excluding double-zeros (Anderson et al. 2006).

Jarak canberra $$ d[jk] = (1/Nz) \sum \frac{x[ij]-x[ik]}{x[ij]+x[ik]}, \text{ binary:}(A+B-2J)/(A+B-J)$$ where NZ is the number of non-zero entries.

Bray-curtis (bray) $$ d[jk] = \frac{\sum|x[ij]-x[ik])}{\sum (x[ij]+x[ik])} \text{ binary:} (A+B-2J)/(A+B) $$

kulczynski $$d[jk]= 1 - 0.5\frac{\sum \min(x[ij],x[ik])}{\sum x[ij]} + \frac{\sum\min(x[ij],x[ik]}{\sum x[ik]} \text{ binary: } 1-(J/A + J/B)/2 $$

morisita $$ d[jk] = 1 - 2\frac{\sum(x[ij]\times x[ik]}{\lambda [j]+\lambda [k]} \times \sum(x[ij])\times \sum(x[ik])), $$ where $$ \lambda [j] = \frac{\sum(x[ij]\times (x[ij]-1))}{\sum(x[ij])}\times \sum(x[ij]-1) \text{ binary: cannot be calculated } $$

horn Like morisita, but $\lambda[j] = \sum(x[ij]^2)/(\sum(x[ij])^2) $ binary:$ (A+B-2*J)/(A+B) $

cao $$ d[jk] = (1/S) \sum(\log(n[i]/2) - (x[ij]\log(x[ik]) + x[ik]\log(x[ij]))/n[i]),$$ where $S$ is the number of species in compared sites and $n[i] = x[ij] + x[ik]$

Bray-Curtis

Statistics Laboratory, Mathematics Department, The Faculty of Sciences, University of Jember

Topic: Cluster and Cluster Validation

Objectives

Subtopics

Clustering & Various Clustering Methods

K-means

PAM (Partitioning Around Medoids)

Diana (DIvisive ANAlysis)

Fanny (Fuzzy ANalYsis

SOM (Self-Organizing Map)

Model based clustering

SOTA

Cluster Validation

Internal Validation

Stability validation

Biological validation

Practices with R

Data Activation

2. Validation Eksploration

Choosing variable & Methods of Validation

Visualization of Validation Results

Graphical visualization of clValid results

Validation scores

3. Some More Detail Analyses with R

KMean

Output (KMeans Centers)

Graphics

PAM (Robust K-Mean)

Output (Robust KMeans Medoids)

Graphics

Model Based

Selected Output

Graphics

Hierarchical Clustering (H-clust, Diana, Agnes)

Output: Ordered

Self Organizing Map (SOM), Fanny & SOTA

Fanny (Fuzzy ANalYsis)

SOTA (SELF ORGANIZING TREE ALGORITHM)

SOM (SELF ORGANIZING MAP)

Graphics of Simple SOM

Graphics of kohonen SOM

Detail of Selected som Output

Reading Sources & Referrences :

Appendices: Distance Measures

Statistics Laboratory, Mathematics Department,
The Faculty of Sciences, University of Jember