Adaptive cluster analysis, classification and multivariate graphics

Methods of cluster analysis, classification and multivariate
graphics can be used in order to extract hidden knowledge from huge data sets
containing numerical and non-numerical information. Usually this task can be
done in a better way by using methods based on adaptive distance measures
(H.-J. Mucha (1992), *Clusteranalyse mit Mikrocomputern.* Akademie Verlag,
Berlin).

Core-based Clustering for Data Mining

Clustering techniques based on cores (representative points) are appropriate tools for data mining of large data sets. In that way a huge amount of values can be analysed efficiently. Moreover, the influence of outliers is reduced. Simulation studies are carried out in order to compare these new clustering techniques with well-known model-based ones.

** **

Some Details on Core-based Clustering

Starting from model-based Gaussian clustering new methods are developed. Parameterizations of the covariance matrix in the Gaussian model and their geometric interpretation are discussed in detail in Banfield and Raftery (1993): Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803–821.

The new methods are based on clustering of cores (representative points, mean vectors). Cores represent regions of high density. More generally, a core represents a set of observations with small distances between one to each other.

Let **X**=(*x _{ij}*)
be a data matrix. In the following, as a
quite simple example, a focus is directed on the case when the covariance
matrix of each cluster is constrained to be diagonal, but otherwise allowed to
vary between groups

_{}, (1)

where

_{}
(2)

is the sample
cross-product matrix for the *k*th cluster and *d _{ij}* are
the pairwise squared Euclidean distances between observations (Mucha, 1992).
Here

_{}. (3)

(2) and (3) can be
generalized by using non-negative
weights of observations instead of *n _{k}*.
That’s one of the key ideas for working with cores instead of observations.

**Simulation**** experiments **

One example will
be given here. There are 250 artificially generated Gaussian samples of size
250 with equal class probabilities drawn (K=2 clusters). Each class is drawn
from a multivariate normal distribution with unit covariance matrix. Class 1
has mean (*a*,*a*,…,*a*) and 2 has mean (-*a*,-*a*,…,-*a*)
with *a*=(4/*J*)^{1/2}. The samples are analysed in a
parallel fashion by the following seven partitioning (p) and hierarchical (h) cluster
analysis techniques *K-Means *(criterion (2), p), *Ward* ((2), h), *WardV*
((3), h), *DistEx* ((2), p), *DistVEx* ((3), p), *DistExCore*
((2), p), and *DistVExCore* ((3), p). Partitioning methods are based on
exchanging observations between clusters. All partitioning methods except *K-Means*
are using pairwise distances between observations only. The last two methods
work with 40 cores instead of 250 observations. The cores are figured out from
each sample by using distance restrictions. Figure 1 shows that *K-Means*,
*DistExCore*, and *DistVExCore* perform best. However the last two
give the most stable results.

** **

Fig 1. Summary of
simulation results of clustering two normals N(*a*,1) and N(-*a*,1)
with *a*=(4/*J*)^{1/2}, where *J*=20 is the number of
dimensions.

These results as well as other simulation experiments (especially those with samples containing outliers) confirm that core-based clustering methods perform very well and become stable against outliers. Moreover, they give good results in practical applications.

**Statistical Software**

Moreover we offer software tools: the statistical software
ClusCorr98 is written in Visual Basic for Applications. Internal and external
databases are accessed from the Excel environment, see H.-J. Mucha, H. H. Bock
(1996): *Classification and multivariate graphics: models, software and
applications. *WIAS Report No. 10, Berlin.

Various kinds of
visualisation of the data and results make it easier to understand the data and
the methods used, and facilitates the formulation of hypotheses, see, for
example, Mucha, H.-J., Simon, U., and Brüggemann, R. (2002): *Model-based
Cluster Analysis Applied to Flow Cytometry Data of Phytoplankton*. Technical
Report 5, WIAS, Berlin.

Fig 2. Fingerprint of the distance matrix of Roman bricks (extract).

**An
Application**

Cluster analysis attempts to divide a set of objects into smaller, homogeneous and at least practical useful subsets (clusters). Objects that are similar one to another form a cluster, whereas dissimilar ones belong to different clusters. Here similar means that the characteristics of any two objects of the same cluster should be close to each other in a well-founded sense. Once a distance measure between objects is derived (multivariate) graphical techniques can be used in order to give some insights into the data under investigation.

Figure 2 presents a graphical output of a distance matrix. Figure 3 shows the result of the core-based clustering of Roman bricks and tiles.

Fig 3. Principal components plot of the result of core-based clustering of 613 observations (Roman bricks).

* *

See also the website

http://www.homepages.ucl.ac.uk/~ucakche/agdank/agdankht2012/MuchaDANKBonn.pdf

for more details and applications.

*Last change 2012-12-14 mucha@wias-berlin.de*