Statistics Laboratory, Mathematics Department,
The Faculty of Sciences, University of Jember
Topic: Cluster and Cluster Validation
By: IM. Tirta & D. Anggraeni, 2015
The objectives of this tutorial are so that readers are able to:
1. utilize validation graphics and validation scores to choose available clustering methods to get a better results of the analyses;
2. do more detail clustering analysis of the chosen method;
3. check more easily if two clustering dendogram are really different each other
4. give helpful visualization of the clustering results
1. Clustering & Various Clustering Methods
2. Cluster Validation
3. Practices with R
4. Reading Sources
Clustering & Various Clustering Methods
In general, the aim of cluster analyses are to identify unobserved grouping in data. They involve various combination of algorithms, distance measures and linkage methods, which frequently lead to different results. Therefore further criteria (validations) are needed to get a better or the best clustering results among the combinations.
R by RDCT (2015) has a wide variety of clustering algorithms available in the base distribution and various add-on packages. There are at least a total of nine (9) algorithms (namely Kmean, PAM, Model based, hirarchical, diana, agnes, fanny, SOM and SOTA). These algorithms spread out in various R packages including cluster (Maechler et al.,2013;Kauffmann & Rousseeuw, 1990) kohonen (Wehrens & Buydens, 2007), mclust (Fraley & Raftery, 2002; Fraley et al), and clValid package (Brock et al., 2008). Description of each clustering algorithms and their availability area described there and are sumarized
below (For detail explanations and examples of most of the methods see Wehrens 2011).
K-means minimizes the within-class sum of squares for a given number of clusters (Hartigan & Wong, 1979). The algorithm starts with an initial guess for the cluster centers, and each observation is placed in the cluster to which it is closest. The cluster centers are then updated, and the entire process is repeated until the cluster centers no longer move. K-means is implemented in the function kmeans(), included in the base distribution of R. This kind of partitional clustering algorithms have been recognized to be better suited for handling large document datasets than hierarchical ones, due to their relatively low computational requirements (Cutting, et al. 1992; Larsen and Aone, 1999; Steinbach, et al. 1997)
PAM (Partitioning Around Medoids)
Partitioning around medoids (PAM) is similar to K-means, but is considered more robust because it admits the use of other dissimilarities besides Euclidean distance. Like K-means, the number of clusters is determined in advance, and an initial set of cluster centers is required to start the algorithm. PAM is available in the cluster package as function pam().
Diana (DIvisive ANAlysis)
Diana is a divisive hierarchical algorithm that initially starts with all observations in a single cluster, and successively divides the clusters until each cluster contains a single observation. Along with SOTA, Diana is one of a few representatives of the divisive hierarchical approach to clustering. Diana is available in function diana() in package cluster .
Fanny (Fuzzy ANalYsis
This algorithm performs fuzzy clustering, where each observation can have partial membership in each cluster (Kauffmann & Rousseeuw, 1990). Thus, each observation has a vector which gives the partial membership to each of the clusters. A hard cluster can be produced by assigning each observation to the cluster where it has the highest membership. Fanny is available in the cluster package (function fanny()).
Self-organizing maps (SOM) is an unsupervised learning technique that is popular among computational biologists and machine learning researchers. SOM is based on neural networks, and is highly regarded for its ability to map and visualize high-dimensional data in two dimensions. SOM is available as the som() function in package kohonen (See also Kohonen, 1997).
Model based clustering
Under this approach, a statistical model consisting of a finite mixture of Gaussian distributions is fit to the data (Fraley & Rafter, 2002). Each mixture component represents a cluster, and the mixture components and group memberships are estimated using maximum likelihood (EM algorithm). The function Mclust() in package mclust implements model based clustering.
Self-organizing tree algorithm (SOTA) is an unsupervised network with a divisive hierarchical binary tree structure. It was originally proposed by Dopazo and Carazo (1997) for phylogenetic reconstruction, and then applied to cluster microarray gene expression data in (Herrero, et al.) It uses a fast algorithm and hence is suitable for clustering a large number of objects. SOTA is included with the clValid package as function sota().
The most common cluster validation techniques are based on one of the following three criteria: external indices, internal indices and relative indices, which are associated with the respective clustering structures, known as partition based, hierarchical and individual clustering (Oksanen, 2010). The clValid package offers three types of cluster validation, namely "internal", "stability", and "biological". At this stage,only two types, i.e. the internal and stability, are implemented in the GUI-web sas indicator to measure the "goodness" of clustering methods.
Internal validation consists of three measures which are Connectivity, Silhouette width, and Dunn index. The internal measures reflect the compactness, connectedness, and separation of the cluster partitions. Connectedness relates to what extent observations are placed in the same cluster as their nearest neighbors in the data space, and is here measured by the connectivity. Compactness assesses cluster homogeneity, while separation quantifies the degree of separation between clusters. Since compactness and separation demonstrate opposing trends (compactness increases with the number of clusters but separation decreases), popular methods combine the two measures into a single score.
The Silhouette width is the average of each observation's Silhouette value. The Silhouette value measures the degree of confidence in the clustering assignment of a particular observation, with well-clustered observations having values near 1 and poorly clustered observations having values near -1.The Dunn index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. Silhouette() function is available in package cluster.
The stability validation compare the results from clustering based on the full data to clustering based on removing each column (variable), one at a time. The included measures are the average proportion of non-overlap (APN), the average distance (AD), the average distance between means (ADM), and the figure of merit (FOM). In all cases the average is taken over all the deleted columns, and all measures should be minimized.
The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed. The APN is in the interval [0, 1], with values close to zero corresponding with highly consistent clustering results. The AD measure computes the average distance between observations placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
The ADM measure computes the average distance between cluster centers for observations placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed. Currently, ADM only uses the Euclidean distance. It also has a value between zero and , and again smaller values are preferred. The FOM measures the average intra-cluster variance of the observations in the deleted column, where the clustering is based on the remaining (undeleted) samples. The measure of cluster validation are summarized in Table 1.
Biological validation is provided especially for microarray data where observations correspond to gene. Two biological validation measures available, the biological homogeneity index (BHI), measures the homogeneity of the clusters and biological stability index (BSI), measures the stability of the cluster. For annotation of the data we apply 3 main packages from bioconductor, namely: Biobase, annotate, and GO.
Practices with R
In this session, readers are given oppurtunities to practice using R (without worrying the program). Attention: Readers should make proper choices in every stage with
orange background. Please consult the referrences in related field for interprating the results.
Pada bagian ini anda dapat memilih beberapa data yang tersedia, serta menampilkannya secara lengkap atau hanya statistika ringkasnya.
Data (Bio) are type of 'microarray' genetic data which are suitable for
validation using biological. For Import Data, spesify the file:
Representation of Data
Output 1. Complete/ Summary of Data
2. Validation Eksploration
After seeing the information about data, we can choose variables for further analyses. We explore cluster validation based on clValid package from Brock et al., 2008). In this stage the examined methods are "kmeans", "pam", "model", "hierarchical". However for detail analyses, we also provide other methods namely: diana, agnes, fanny, sota, SOM. So all together there are 9 methods of clusterings can be done here.
Choosing variable & Methods of Validation
The majority of cluster analyses only analize numerical variables (not factors).
Choose variable with numerical scale (not factors).
Brock et al., (2008) provide 3 validation methods, all can be applied here, these are "internal", "stability","biological". Choose 1 in turn and see the results (numerical and graphical). Components and criteria for good cluster ara given in Table 1. Notes:Computation of biological validation take longer than others (more than 3 $\times$ those of internal & Stability), so it is advized that user examine "internal" and "stability" first before examining the "biological" measure and "biological" measures only for specific type of data (eg 'microarry', genetic data)
Table 1. Summary of Validation Criteria using clValid
APN (average proportion non-overlap)
AD (average distance)
ADM (average distance between means )
Visualization of Validation Results
The validation results are presented in the form of grahics and numerical results.
Graphical visualization of clValid results
Fig. 1. Validation of four (4) cluster algorithms
Two type of output are available the complete output and the summary which only give the
name and the score of suggested (indicated) 'best' clustering results.
The summary of various validation scores, suggest the most appropriate method and number of
cluster. We then can proceed to the detail of some chosen methods. If the suggested method and number of cluster is not
unique, then we use graphical visualization of the various method to ensure
us the most appropriate number of cluster for the current data.
3. Some More Detail Analyses with R
After having idea about appropriate cluster method and number of cluster,
user can analyse further for focusing on the methods.
For theory of clustring user can read Everitt et al. (2011).
In using KMean, user must determine the number of cluster before analysing data.
KMean only use Euclidean distance,
for calculation KMean can use one of 3 (three ) algorithms (see Maechler et. al, 2015).
Output (KMeans Centers)
PAM (Robust K-Mean)
PAM (Partition Arround Medoid) is sometimes called Robust K-Mean
and in addition to euclidean distance it can also utilizes other distance (such as manhattan)
Output (Robust KMeans Medoids)
Select the required output
Hierarchical Clustering (H-clust, Diana, Agnes)
Valid distance for Agnes only 'euclidean' or 'manhattan'
Fig. Cluster Dendrogram
Fig. Cluster Dendrogram with Horizontal Layout
For hierarchical clustering with Agnes, the analysis can be combined with PCA applying
function HCPC in FactoMineR package (Husson, 2015).
Fig. Cluster Dendogram with PCA
Self Organizing Map (SOM), Fanny & SOTA
Select new setting for number of cluster and