Adaptive cluster analysis, classification and multivariate graphics

 

Methods of cluster analysis, classification and multivariate graphics can be used in order to extract hidden knowledge from huge data sets containing numerical and non-numerical information. Usually this task can be done in a better way by using methods based on adaptive distance measures (H.-J. Mucha (1992), Clusteranalyse mit Mikrocomputern. Akademie Verlag, Berlin).

 

 

Core-based Clustering for Data Mining

 

Clustering techniques based on cores (representative points) are appropriate tools for data mining of large data sets. In that way a huge amount of values can be analysed efficiently. Moreover, the influence of outliers is reduced. Simulation studies are carried out in order to compare these new clustering techniques with well-known model-based ones.

 

Some Details on Core-based Clustering

 

Starting from model-based Gaussian clustering new methods are developed. Parameterizations of the covariance matrix in the Gaussian model and their geometric interpretation are discussed in detail in Banfield and Raftery (1993): Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803–821.

The new methods are based on clustering of cores (representative points, mean vectors). Cores represent regions of high density. More generally, a core represents a set of observations with small distances between one to each other.

Let X=(xij) be a data matrix. In the following, as a quite simple example, a focus is directed on the case when the covariance matrix of each cluster is constrained to be diagonal, but otherwise allowed to vary between groups Ck. This is a step toward generalization of the sum of squares criterion VK that can be written without using mean vectors :

,                     (1)

where

                                     (2)

is the sample cross-product matrix for the kth cluster and dij are the pairwise squared Euclidean distances between observations (Mucha, 1992). Here nk denotes the number of observations of the kth cluster. Once more an equivalent formulation holds for the logarithmic sum-of-squares criterion that has also to be minimized:

.                     (3)

(2) and (3) can be generalized by using non-negative weights of observations instead of nk. That’s one of the key ideas for working with cores instead of observations.

 

Simulation experiments

 

One example will be given here. There are 250 artificially generated Gaussian samples of size 250 with equal class probabilities drawn (K=2 clusters). Each class is drawn from a multivariate normal distribution with unit covariance matrix. Class 1 has mean (a,a,…,a) and 2 has mean (-a,-a,…,-a) with a=(4/J)1/2. The samples are analysed in a parallel fashion by the following seven partitioning (p) and hierarchical (h) cluster analysis techniques K-Means (criterion (2), p), Ward ((2), h), WardV ((3), h), DistEx ((2), p), DistVEx ((3), p), DistExCore ((2), p), and DistVExCore ((3), p). Partitioning methods are based on exchanging observations between clusters. All partitioning methods except K-Means are using pairwise distances between observations only. The last two methods work with 40 cores instead of 250 observations. The cores are figured out from each sample by using distance restrictions. Figure 1 shows that K-Means, DistExCore, and DistVExCore perform best. However the last two give the most stable results.

 

 

Fig 1. Summary of simulation results of clustering two normals N(a,1) and N(-a,1) with a=(4/J)1/2, where J=20 is the number of dimensions.

 

These results as well as other simulation experiments (especially those with samples containing outliers) confirm that core-based clustering methods perform very well and become stable against outliers. Moreover, they give good results in practical applications.

 

Statistical Software

 

Moreover we offer software tools: the statistical software ClusCorr98 is written in Visual Basic for Applications. Internal and external databases are accessed from the Excel environment, see H.-J. Mucha, H. H. Bock (1996): Classification and multivariate graphics: models, software and applications. WIAS Report No. 10, Berlin.

Various kinds of visualisation of the data and results make it easier to understand the data and the methods used, and facilitates the formulation of hypotheses, see, for example, Mucha, H.-J., Simon, U., and Brüggemann, R. (2002): Model-based Cluster Analysis Applied to Flow Cytometry Data of Phytoplankton. Technical Report 5, WIAS, Berlin.

 

 

 

Fig 2. Fingerprint of the distance matrix of Roman bricks (extract).

 

 

An Application

Cluster analysis attempts to divide a set of objects into smaller, homogeneous and at least practical useful subsets (clusters). Objects that are similar one to another form a cluster, whereas dissimilar ones belong to different clusters. Here similar means that the characteristics of any two objects of the same cluster should be close to each other in a well-founded sense. Once a distance measure between objects is derived (multivariate) graphical techniques can be used in order to give some insights into the data under investigation.

Figure 2 presents a graphical output of a distance matrix. Figure 3 shows the result of the core-based clustering of Roman bricks and tiles.

 

Fig 3. Principal components plot of the result of core-based clustering of 613 observations (Roman bricks).

 

 

See also the website

http://www.homepages.ucl.ac.uk/~ucakche/agdank/agdankht2012/MuchaDANKBonn.pdf

for more details and applications.

Last change 2012-12-14        mucha@wias-berlin.de