(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "A kth nearest neighbour clustering procedure"

HD28 

.M414 



Nc i^^'' 



Dewey 
MAY £6 1981 



WORKING PAPER 
ALFRED P. SLOAN SCHOOL OF MANAGEMENT 



A kth Nearest Neighbour Clustering Procedure 
M. Anthony Wong and Tom Lane 



WP#1213.81 



May 1981 



MASSACHUSETTS 

INSTITUTE OF TECHNOLOGY 

50 MEMORIAL DRIVE 

CAMBRIDGE, MASSACHUSETTS 02139 



A kth Nearest Neighbour Clustering Procedure 
M. Anthony Wong and Tom Lane 



WP#1213.81 May 1981 



M.I.T. LIBRARIES 

MAY 2 6 1981 



1 



A kTH NEAREST NEIGHBOUR CLUSTERING PROCEDURE 

M. Anthony Wong and Torn Lane 

Sloan School of Management 

Massachusetts Institute of Technology 

Cambridge, MA 02139 

SUMMARY 
Due to the lack of development in the probabilistic and statistical 
aspects of clustering research, clustering procedures are often 
regarded as heuristics generating artificial clusters from a given 
set of sample data. In this paper, a clustering procedure that is 
useful for drawing statistical inference about the underlying 
population from a random sample is developed. It is based on the 
uniformly consistent kth nearest neighbour density estimate, and is 
applicable to both case-by-variable data matrices and case-by-case 
dissimilarity matrices. The proposed clustering procedure is shown 
to be asymptotically consistent for high-density clusters in several 
dimensions, and its small-sample behavior is illustrated by 
empirical examples. A real application is also included to 
demonstrate the practical utility of this clustering method. 
Keywords: CLUSTERING PROCEDURE; HIGH-DENSITY CLUSTERS; kTH NEAREST NEIGHBOUR 
DENSITY ESTIMATION; SET -CONSISTENCY. 




(^742€26 



1. INTRODUCTION 

1. 1 Shortcomings of Clustering Procedures 
A recent study by Blashfield and Aldenderfer (1978) shows that numerous 
clustering methods have been developed in the past two decades. A review of 
many of these techniques can be found in Cormack (1971), Anderberg (1973), 
Sneath and Sokal (1973), Everitt (19 74), Hartigan (1975), and Spath (1980). 
However, hardly any of the originators of these methods have approached the 
clustering problem from within a theoretical framework. More often than not, 
the concept of a real population cluster is vague and is left undefined. 
Since no statistical evaluation of the sample clusters can be performed under 
the circumstance, the validity of the clusters obtained by these methods is 
always questionable. Consequently, the existing clustering procedures are 
often regarded as heuristics generating artificial clusters from a given set 
of sample data, and there is a need of clustering procedures that are useful 
for drawing statistical inference about the underlying population from a 
sample. In this paper, a clustering procedure based on the kth nearest 
neighbour density estimate is proposed, and it is shown to be set-consistent 
for high-density clusters in several dimensions. The set-consistency property 
of a hierarchical clustering procedure will be defined next . 

1. 2 A Theoretical Approach to Evaluating Hierarchical Clustering Methods 
In order to evaluate the sampling property of a clustering method, it is 
necessary to have population clusters defined on population probability 
density functions from which the observations are obtained, and to have some 
ways of judging how the sample clusters deviate from the population clusters. 
Let the observations x, , x„, ..., x in p-dimensional space be sampled from a 
population with density f, taken with respect to Lebesque measure. Using the 



high-density clustering model given in Hartigan (1975, p. 205), the true 

population clusters can be defined on f as follows: a high-density cluster at 

level f in the population is defined as a maximal connected set of the form 

{x|f(x) > f }. The family T of such clusters forms a tree, in that At T, Be 

T implies either A ^ B, B => A, or A n B = 4>. A hierarchical clustering 

procedure, which produces a sample clustering tree \,on the observations x. , 

..., X , may then be evaluated by examining whether T^ converges to T with 

probability one when N approaches infinity. A clustering method (or 

equivalently, T ) is said to be strongly set-consistent for high-density 

clusters (or T) if for any A, B e T, A n B = 4), 

P {A„ n B., = t) as N -► "} = 1, 
r N N 

where A^ and B^ are respectively the smallest cluster in the sample tree Tj^ 
containing all the sample points in A and B. Since A ^ b implies k^ c b^^, 
this limit result means that the tree relationship in T„ converges strongly to 
the tree relationship in T. 

Using this definition of consistency, hierarchical clustering methods can 
be evaluated by examining whether they are strongly set-consistent for 
high-density clusters. If a clustering procedure is set-consistent, the 
sequence of enlarging hierarchical clusters that it produces in the sample are 
groups of points lying within successively lower density contours in the 
underlying distribution. Hence, these sample high-density clusters are useful 
in indicating the number of modal regions in the population as well as 
identifying their locations in the underlying space. And since it is the 
geometrical shape of the population density contours that determines the 
configuration of the sample high-density clusters, a set-consistent clustering 
method does not impose structure on the clusters it produces. (See Everitt 
(1974, Chapter 4) for some well-known clustering methods that impose a 



spherical structure on the clusters they produce.) On the other hand, a 
clustering procedure that is not set-consistent is not adaptive to the 
underlying distribution, and is hence not suitable for detecting high-density 
or "natural" clusters (see Carmichael et . al . , 1968). 

Hartigan (1977a, 1977b, 1979) has examined the set-consistency of many of 
the best known hierarchical clustering methods for high-density clusters. It 
was shown that the complete linkage (Sorenson 1948) and average linkage 
(Sneath and Sokal 1973) methods are not set-consistent, while single linkage 
(Sneath 1957) is weakly set-consistent in one dimension but not in higher 
dimensions. Thus most of the relevant evaluative work under the high-density 
clustering model have been carried out. However, the important problem of 
developing clustering procedures that are set-consistent for high-density 
clusters did not receive much attention. In Hartigan and Wong (1979), and 
Wong (1980), a hybrid clustering method is developed which is weakly 
set-consistent for high-density clusters in one dimension; and, there exist 
empirical evidence that similar consistency results hold in several 
dimensions. However, although the hybrid method has the advantage of being 
practicable for very large data sets, it is not well-suited for small samples 
(n < 100) and it is only applicable to case-by-variable data matrices. In 
this paper, a strongly set-consistent clustering procedure is developed which 
is applicable to both case-by-variable data matrices and case-by-case distance 
matrices, and its development is outlined next. 

1.3 Development of the kth Nearest Neighbour Clustering Procedure 

Under the high-density clustering model, density estimates can be used to 
generate sample clusters, namely the high-density clusters defined on the 
estimates. And a clustering procedure is expected to be set-consistent for 
high-density clusters if it is based on a uniformly consistent density 



estimate. Single linkage corresponds to nearest neighbour density estimation 
(Hartigan 1977b), in which the density estimate fxjCx) at a point x is 
inversely proportional to the volume of the smallest closed sphere including 
one sample point. This density estimate is not consistent in the sense that 
f„(x) does not approach f(x) in probability. An improved density estimate, 
and perhaps improved clustering, can be obtained by the kth nearest neighbour 
density estimate: the estimated density at point x is f^(x) = k/(N V (x)), 
where V (x) is the volume of the closed sphere centered at x containing k 
sample points. Such a density estimate is uniformly consistent with 
probability 1 if f is uniformly continuous and if k = k(N) satisfies k(N)/N * 
and k(N)/log N -► ». (See, for example, Devroye and Wagner 1977, and Moore 
and Yackel 19 77.) 

Wishart (1969), in an attempt to improve on the single linkage clustering 
technique, developed a procedure entitled Mode Analysis which is related to 
the kth nearest neighbour density estimate. However, Wishart' s procedure was 
not designed to obtain the high-density clusters defined on the density 
estimate, and hence its set-consistency for high-density clusters was never 
established. Moreover, since its computational algorithm is quite 
complicated, the Mode Algorithm did not receive much attention in the 
clustering literature. In this paper, a clustering algorithm for deriving the 
tree of sample high-density clusters from the kth nearest neighbour density 
estimate is developed. A detailed description of this clustering procedure is 
given in Section 2. In Section 3, it is established that the proposed method 
is strongly set-consistent for high-density clusters. Empirical examples are 
given in Section 4 to illustrate the small-sample behavior of kth nearest 
neighbour clustering. A real example is presented in Section 5 to demonstrate 
the practical utility of the proposed clustering method. 



2. A KTH NEAREST NEIGHBOUR CLUSTERING PROCEDURE 

The proposed nearest neighbour clustering algorithm consists of two 
stages. At the first stage, the kth nearest neighbour density estimation 
procedure is used to obtain a uniformly consistent estimate of the underlying 
density. The tree of sample high-density clusters defined on the estimated 
density is computed at the second stage of the algorithm. At this latter 
stage, a distance matrix is first computed in which the distance between two 
"neighbouring" points (i.e. points with the property that at least one point 
is one of the kth nearest neighbour of the other) is defined to be inversely 
proportional to a pooled density estimate at the point halfway between them, 
and the single linkage clustering algorithm (Sneath, 1957) is then applied to 
this distance matrix to obtain the tree of sample clusters. 

2.1 The Density Estimation Stage 

The kth nearest neighbour density estimation procedure is used in this 
stage of the clustering procedure because it provides a strongly uniform 
consistent estimate of the underlying density. Let x, , . . . ,Xj^ be independent, 
identically distributed random vectors with values in R , p ^ 1^ and with a 
common probability density f. If V, (x) is the volume of the smallest sphere 
centered at x and containing at least k of the random vectors Xj|^,...,Xj^, then 
the kth nearest neighbour density estimate of f at x is 

fj,(x) = k/(NVj^(x)) 
And in Devroye and Wagner (19 77), the following strong uniform consistency 
result of this estimate is shown: 
Lemma (Devroye and Wagner, 19 77): 

If f is uniformly continuous on R^ and if k = k(N) is a sequence of 
positive integers satisfying: 



(a) k(N)/N - 0, and 

(b) k(N)/log N -- «, as N -► », 
then 

^"P I fjj(x) - f(x)| -.- with probability 1. 
One purpose of the kth nearest neighbour clustering method is to discover the 
population high-density clusters given a random sample from some underlying 
distribution F with density f. In this first step of the proposed procedure, 
a uniformly consistent estimate of f is obtained. The high-density clusters 
defined on the estimated density f„ can then be used as sample estimates of 
the population high-density clusters defined on f. These hierarchical sample 
high-density clusters are constructed in the second stage of the proposed 
clustering algorithm. 

2. 2 The Hierarchical Clustering Stage 

In this stage, a distance matrix D(x.,x.), 1 ^ i, j ^ N, for the N 
observations is first computed using the following definitions: 

Definition 1: Two observations x. and x. are said to be neighbours if 
d*(x.,x.) < d, (x.) or d,(x.), where d* is the Euclidean metric and d,(x.) is 
the kth nearest neighbour distance to point x.. 

Definition 2: The distance D(»,*) between the observations x. and x. 
is 

D(x.,Xj) = (l/2)[l/fj^(x.) + l/fj^(x.)] = |i^[Vj^(x.) + Vj^CxJ], if x. and x^ are 

neighbors ; 

= °° , otherwise . 

Hence, finite distances are defined only for pairs of observations which 

are in the same neighbourhood in R^, and the defined distance between a pair 

of neighbouring observations is inversely proportional to a pooled density 

estimate at the point halfway between them. The following single linkage 



clustering technique is then applied to this distance matrix D to obtain the 
tree of sample high-density clusters. 

Given a set of observations of objects x., . . . , x„ with distances 
D(x.,x.), 1 ^ i < j ^ N, single linkage clusters are defined as follows: let 
X- and X. be the closest pair of objects; amalgamate them to form a cluster c 
and define the distance between that cluster and any object J^ be D(c,x ) = min 
[D(x. ,x ),D(x.,x )]; repeat the process treating c as an object and ignoring 
X. and X.. The amalgamation continues until all objects are grouped in one 
large cluster. All clusters obtained in the course of this hierarchical 
algorithm are single linkage clusters. (See Gower and Ross (1969), and 
Hartigan (19 75) for computational single linkage algorithms.) Single linkage 
clustering is used in this step of the proposed procedure because it has the 
following property: at every stage of the clustering, the single linkage 
clusters are the maximal linked sets if objects x^ and x. are said to be 
linked whenever D(x.,x.) is no greater than a given distance D . Now, since 



the distance D between two "neighboring" observations is reciprocal to the 
density estimate f^, at the midpoint between them, every cluster obtained by 
applying single linkage to D has the property that the density estimates over 
the objects in this cluster are greater than a certain density level f . 
Moreover, as the distance measure D is defined only for pairs of 
"neighbouring" observatons, the resultant single linkage clusters correspond 
to maximal connected sets of the form {x|fj^(x) ^ f^}, which are the 
high-density clusters defined on f„. 

2. 3 The Computational Algorithm 
Since high-density clusters are invariant to monotone transformations of 
the density function, the kth nearest neighbour distances dj^(x^), i = 1, ..., 
N are used instead of the V,(x.)'s in the following computational algorithm of 



the kth nearest neighbour clustering procedure: 

STEP 1: For i = 1, 2, ..., N, compute d,(x.), the kth nearest neighbour 

distance of x- . (For a computationally efficient algorithm to find 
the kth nearest neighbour distances, see Friedman et . al . , 19 75.) 

STEP 2: Compute the distance matrix D as follows: 
D(x.,x.) = (l/2)[d,(x.) + d,(x.)] if 

1 J K 1 "^ J 

d*(x.,x.) < d, (x.) or d,(x.), where d* is the Euclidean 
i' J k 1 k J ' 

metric ; 
= ", otherwise. 
STEP 3: Apply the single linkage clustering algorithm to the computed 

distance matrix D to obtain the sample tree of high-density clusters. 
The computational requirements for STEP 1 and STEP 3 are 0(p N log N) and 
0(nk) respectively. Hence, unlike the hybrid clustering method (Hartigan and 
Wong, 1979 and Wong, 1980), this procedure is not practicable for large data 
sets; but, it is better suited for small samples, and is applicable to both 
case-by-variable data matrices and case-by-case dissimilarity matrices. 

3. STRONG SET -CONSISTENCY OF kTH NEAREST NEIGHBOUR CLUSTERING 
The asymptotic consistency of the kth nearest neighbour clustering method 
for high-density clusters in R^, p > 1, is given in the following theorem: 



Theorem: Let f denote a positive, uniformly continuous function on R such 

that {x|f(x) ^ f } is the union of a finite number of compact subsets of R 
o 

for every f > 0. Let T be the tree of population high-density clusters 
defined on f. Supose that A and B are any two disjoint high-density clusters 
in T with connected interiors. Let x, , . . . , Xj^ be a random sample from f and 
let T be the hierarchical clustering specified by the kth nearest neighbour 



10 



clustering algorithm. Then, provided that k. = k(N) satisfies 

(a) k.(N)/N ♦ 0, and 

(b) k(N)/log N * ». 

as N - », there exist Aj^, B^^ e T^ with A^^ ^ A^^ n {x^ , . . . ,Xj^} , B^^ :^ B n 
{x,,...,Xj,} and A n Bj. = <^ with probability 1. 

Proof : Since T is the tree of high-density clusters for f„, this theorem is 
a direct consequence of the Lemma, which states that 

^xP |fj^(x) - f(x)| - 0, with probability 1. (3.1) 

By definition, for any two disjoint high-density clusters A and B in T, there 
exist 6 > 0, c > and X > 0, such that 

(i) f(x) > X for all X e A u B, and (3.2) 

(ii) each rectilinear path between A and B contains a segment, with 

length greater than 6, along which the density f(x) < X - 3e . (3.3) 
From (3.1), we have for N large, 
*xP |fj^(x) - f(x)| < e w.p. 1. 
Thus, it follows fromm (3.2) and (3.3) that for N large, with probability 1, 

(iii) fxj(x) > X - e for all x e A u B, and 

(iv) each rectilinear path between A and B contains a segment, with 

length greater than 6, along which the density estimate fM(x) ^ X - 
2e. 

Since A and B are disjoint, it follows from (3.4) and (3.5) that 
high-density clusters of the form {x|fj^.(x) > X - e} separate the observations 
in A and B. The theorem follows. 



XX 



4. EMPIRICAL STUDY OF THE SMALL-SAMPLE BEHAVIOR OF THE kTH NEAREST NEIGHBOUR 
CLUSTERING PROCEDURE 

To illustrate the small-sample behavior of the kth nearest neighbor 
clustering procedure, an empirical study was performed in which the procedure 
is applied to various generated data sets. Results of three experiments, in 
which bivariate data were used, are reported here. 

1. Experiment One : 30 observations were generated so that two spherical 
clusters of observations are present in this data set. The scatter-plot of 
this sample set is shown in Figure la, in which the observation numbers are 
plotted next to the observations. This data set is useful for illustrating 
the effectiveness of the proposed procedure in identifying spherical clusters. 
The dendrogram giving the hierarchical clustering obtained by the kth nearest 
neighbour method (using k = 4) is shown in Figure lb. It is clear that, in 
this experiment, the kth nearest neighbour clustering indicates the presence 
of two modal regions of clusters. 

However, for a choice of k which is much too small, the dendrogram 
produced by the hybrid method would tend to suggest the presence of a few 
extra modal regions. Tne reason for this moderate sensitivity of the proposed 
method to the choice of k is that, if k is too small, extra modes tend to 
appear in the kth nearest neighbour density estimate, and these bumps in the 
estimated density function are identified as modal regions in the hierarchical 
clustering stage of the algorithm. 

2. Experiment Two : 58 observations were generated so that two elongated, 
elliptical clusters of observations are present in this data set. The scatter 
plot of this sample set is shown in Figure 2a, in which the observation 
numbers are plotted next to the observations. This data set is useful for 



12 



illustrating the effectiveness of the proposed clustering procedure 
inidentifying non-spherical clusters. The dendrogram giving the hierarchical 
clustering obtained by the kth nearest neighbour method (using k = 4) is shown 
in Figure 2b. Two disjoint modal regions, corresponding to the two elliptical 
clusters of observations shown in Figure 2a can be identified in this 
dendrogram. However, observations 51, 58, and 37 form a minor modal region 
within one of the two clusters; and, observations 22, 23, and 2 form a minor 
modal region in the other cluster. 

3. Experiment Thre e: 60 observations were generated so that two spherical 
clusters of observations are present in this sample and they are connected by 
a chain of "noise" observations (see Figure 3a). This data set is useful for 
demonstrating the effectiveness of the proposed method when a moderate amount 
of noise is present in the sample. The hierarchical clustering obtained by 
the kth nearest neighbour method (using k = 4) is shown in Figure 3b. It can 
be seen that the two spherical clusters are recovered by the proposed method 
as modal regions, in spite of the presence of the noise observations. 

5. A REAL EXAMPLE 
In order to illustrate how the hybrid clustering method works in 
practice, it is applied to the well-known Iris data given in Fisher (1936). 
The data consist of four characteristics for three species of Iris; the 
species are Iris Setosa, Iris Versicolor and Iris Virginica, and the 
characteristics are sepal length, sepal width, petal length, and petal width. 
There are fifty samples from each species, and hence the total sample size is 
150. This data set has been used by many authors to test the practical 
utility of various clustering algorithms (e.g. Friedman and Rubin, 1967). It 
has been found that there are two distinct clusters of samples in this data 



iJ 



set; one corresponding to the samples from Iris Setosa, and the other 
corresponding to samples from the other two species. Moreover, the samples 
from Iris Versicolor and Iris Virginica form a somewhat homogeneous group and 
there is no clearcut distinction between samples from these two different 
species (see, for example, Fisher, 1936; Friedman and Rubin, 1967; and 
Gnanadesikan, 1977). 

The hierarchical clustering obtained by applying the kth nearest neighbour 
method is this data set, using a value of k = 8, is shown in Figure 4. Two 
distinct modal regions can be identified in this sample tree of high-density 
clusters; one corresponding to the samples from Iris Setosa, and the other 
corresponding to the samples from Iris Versicolor and Iris Virginica. 
Moreover, within the Versicolor-Virginica modal region, there are two 
sub-modal regions, one such region is consisted of samples only from Iris 
Versicolor, while the other region is consisted of samples only from Iris 
Virginica. However, it should be pointed out that if k was chosen to be much 
longer than 8 (say, k = 12 or 15), the two sub-modal regions would not appear 
in the hierarchical clustering and only the two well-known distinct clusters 
can be identified by the kth nearest neighbour method. 



10.fi 



7.5*- 



5.e- 



2.5- 



0.0 

0.0 











^' 25 


• 










20 g3 
2S 




A3 


3 

2 


.7 


^ 


ABAS ^^ 
20 21 22 




X^ 


XS 


X 


XX 






xs 

1 


I 


XB 

i_ 




„... I .... , l_ -U 





l.H 



2.S 



4.3 



5.7 



7.1 



S.S 



19. 



Fig. la. Scatter-plot of the generated bivariate sample (N = 30) 

used in Experiment One. Observation numbers are plotted 
next to the observations. 



Dist. 


Obser. 








Level 


No. 










■ -. 




• .•U4 


1 
» -1 








«.«>) 


«— . 

4 — 1 








e.tii 


I-. 
• -• 1 








0.9H 










».0» 


«0 -1 








1.071 


T -• i 








I.IK 


I-. 
1 -> t 








l.lSt 


J -t 








t.31 


> 




1 




t.St4 


t* 




1- 


1 


1.»T» 


l> 






t-. 
-• 1 


<.7a> 


<• 






1 
-t 


• .•4S 








t . , 




•a 






I ' 


».e» 








I^M. 


i« 








».TJJ 










5.4*1 


19 








• .7»1 


>« -1 








• .»9» 


i» -• i 








e.>]* 


I-. 

17 -• 1 








e.«»T 


>• -1 








0.»7» 


>t -1 








e.9>T 










e.9i» 


»» -1 








. 


IS -• ! 








t.e9i 


I-. 

la -• 1 








l.tti 


I-. 








t.as 


t- 
»• -• 








1.32I 


)• ■* 








1.3» 




-. 






«.4a 


la 


1 






la 






I 


«.BT» 


ir 






1 

1 


a.i«» 


It 









t 
I- 



Fig. lb. Tree of sample high-density clusters for data shown in Fig. la, 
derived from the kth nearest neighbour density estimate using 
k = 4. 



10,0 




10.0 



Fig. 2 a Scatter-plot of the generated bivariate sample (N = 58) 
used in Experiment Two. Observation numbers are plotted 
next to the observations. 



UlSt. 

Level 

0.»-)7 
».'^%.» 

• .S»'. 

e.ti5 

• .•»« 
t.to 

• .«>• 

• .(ji 

e.(»« 
a. 7 

0.7]> 
O.T]> 

• .7 3* 
•.743 

• .(47 

• ••&» 

• .•71 

• ••T* 

•.•• 
•.»■» 

t.O* 

l.13t 

1.17* 

*.7»1 

• .»>» 

• .•» 

• .•M 

• .»*» 

• .tS4 

• ••71 

• .tf» 

• .t>t 

• .•»• 

• .t>» 

•.711 

• .731 

• .734 

• .734 

• .734 

• .734 

• .777 

• .•■» 

• .(•* 
».«I7 

• .941 

• .♦»7 
1.001 

i.a»* 
i.>»i 



Ubser . 

No. 
— »i -. 



33 



I 

-I 

I 

47 -I 

I- 

3'> -I ; 

I I 

i« -I I 

I • 

33 -' I 

I- 
3» -' 



40 

a* 

34 
51 
if 
37 
54 
15 
30 
30 
5» 
41 
4* 

at 

•• 

41 
45 
51 
55 
•> 
31 
4* 



I 

I-. 

-• I 

I- 



I 

I 
-I 

1-. 
-• I 
I-- 



41 
11 
l» 
II 
1» 
\* 
• • 
\ 
1» 
17 
33 
13 

a 

34 
19 
II 

3 

5 

10 

14 

10 
II 

• 

4 

1> 

7 
• 

B7 



• -. 



-• I 
I 



Fig. 2b 



Tree of sample high-density clusters for data shown in Figure 2a, 
derived from the kth nearest neighbour density estimate using k = 4. 



10.0 



53 



51 



.43. 



sP£.3i^' 



31 ^9633^'' -"^^ 



•*••) 



G=* 53 ^^^ 

22 
5.0 B IC ^^ ^ ^ 

S.5 



0.0' 



0.O a.e <4.o s.o 8.0 10.0 



Fig. 3a Scatter-plot of the generated bivariate sample (N = 60) used in 
Experiment 3. Observation numbers are plotted next to the 
observations . 



Dist. 


Obser. 


Level 


No. 




33 -. 


».*tm 


1 




)« -1 


».*C 


1 




»« -1 


• .4U 


1 




»« -1 


9.«i» 


I 




3* -1 


• .«i'. 


1 




>» -1 


0.«i» 


1 




• » -1 


e.4tl 


I-. 




JJ -• 1 


• .•f 1 






J» •! 


• .«• 






«« -1 


• .4TJ 






*0 -1 


t.*it 






J* -! 


t.lo« 






41 -J 


».>«< 






«J -1 


t.ili 


I-. 




»» -• 1 


t.ili 


1 




«> -I 


• .»> 


1 




«» -1 


P.lJt 


1 




»» -« 


• .14* 


1 




S» -I 


• •»>« 


1- 




4* 


• .>M 






«* 


• .>* 






a* 


• .•3J 






»o 


0.»]4 




, 


» 


• .«»> 






» -. 


• .«»] 






• -1 


• .»•! 






<> -I 


• .XT 


!'• 




« -• 1 


• .»]• 


1 




• -« 


• .»» 


1 




• 4 -1 


».»»» 


1 




\t 


».»» 






» 


o.» 






y 


».»»3 






le 


».*V« 






It 


••.»»l 






<* 


• .TJ« 






1* 


• .79» 






> 


• .T>» 






>* 


• .Til 






I 


• •>•• 






• 


».>«« 






19 


O.tJl 






«T 


e.«}> 






at 


• .»)) 






33 


e.»77 






33 


• .»JT 






34 


1.»>» 






»» 


%.»0> 






4» 


■3.0»2 






4* 


>.»«J 






** 


».<T 






SO 


1.331 






»• 


a.)>> 






• 3 


!.»<' 






t3 


».TI3 






• I 


».T»» 






l» 


3. (OS 






»» 


».33l 






»t 


>.)J> 






S3 


FiK 


. 3 b 



-• I 
»- 



I 
I— 



• I 
I-. 
-• I 



I 

I-. 
-• I 
I — . 



I 
I — 



Tree of sample high-density clusters for data shown in Fig. 3a, derived from 
the kth nearest neighbour density estimate using k = 4. 



!.-• 



H 



I 



00 



>i 



•H 
U 



nj in 

U 01 

•H -H 

C a 

•H 01 

60 a. 

u en 

•H 



T3 
O 

B 
M 

O 

•H 
0) 

c 



CO 
0) 

u 
c 

u 

x: 



C 
•rl 
CO 

3 



CO 

•H 



O CO 

.H CU 

O -H 

CJ O 

•H aj 

to ex 

)-l C/3 



I. 



I 



CO 

O CO 

J-l 0) 

(1) -H 

en U 

(1) 

CO D. 

•H C/> 



F . ^ 



10 
T> 

CO 

•H 

H 

(U 
X! 



O 

•a 

0) 

c 
to 

XI 

o 

bO 
C 
•H 

(U 
U 
CO 

3 

iH 
U 



CJ 

•H 

x: 
u 
u 

M 
0) 
•H 
PC 



00 

•H 



14 



REFERENCES 

Anderberg, M.R. (1973). Cluster Analysis for Applications . New York: 
Academic Press. 

Blashfield, R. K. , and Aldenderfer, M.S. (1978). "The Literature on Cluster 
Analysis". Multivariate Behavioral Research , 13, 271-295. 

Carmichael, J.W., George, J. A., and Julius, R. S. (1968), "Finding natural 
clusters". Systematic Zoology , 17, 144-150. 

Cormack, R. M. (1971). "A Review of Classification". Journal of the Royal 
Statistical Society , Series A, 134, 321-367. 

Devroye , L.P., and Wagner, T.J. (1977). "The strong uniform consistency of 
nearest neighbour density estimates". Annals of Statistics . 5, 536-540. 

Everitt, B. S. (1974). Cluster Analysis , Halsted Press, New York: John Wiley. 

Fisher, R.A. (1936). "Use of multiple measurements in taxonomic problems". 
Ann. Engen. Lond. , 7, 179-188. 

Friedman, H. P. , and Rubin, J. (1967). "On some invariant criteria for grouping 
data". Journal of the American Statistical Association , 62, 1159-1178. 

Friedman, J.H. , Bentley, J. L. , and Finkel , R.A. (1975). "An algorithm for 
finding best matches in logarithmic time." SLAC-PUB-1549, Feb., 1975. 

Gnanadesikan , R. (1977). Statistical Data Analysis for Multivariate 
Observations . New York: John Wiley & Sons, 317-221. 

Gower , J.C. and Ross, G.J.S. (1969). "Minimum spanning trees and single 
linkage cluster analysis". Applied Statistics , 18, 54-64. 

Hartigan, J. A. (1975). Clustering Algorithms . New York: John Wiley & Sons. 

(1977a). "Distributional problems in clustering", in 



Classification and Clustering , ed . J. Van Ryzin, New York: Academic Press, 
(1977b). "Clusters as modes", in First International Symposium 



on Data Analysis and Informatics , Vol. 2, IRIA, Versailles, 

(1979). "Consistency of single linkage for high-density 



clusters". Unpublished manuscript. Department of Statistics, Yale 
University . 

, and Wong, M.A. (1979). "Hybrid Clustering". Proceedings of 



the 12th Interface Symposium on Computer Science and Statistics, ed . Jane 
Gentleman, U. of Waterloo, Press, pp. 137-143. 

Moore, D.S. and Yac kel , J.W. (1977). "Consistency properties of nearest 
neighbour density function estimators. Annals of Statistics, 5, 143-154. 



15 



Sneath, P.H.A. (1957). "The application of computers to taxonomy", Journal of 
General Microbiology . 17, 201-226. 

, and Sokal , R. R. (1973). Numerical Taxonomy . San Francisco: 



W.H. Freeman, 



Sorenson, T. (1948). "A method of estimating groups of equal amplitude in 
plant sociology based on similarity of species content". K. Danske Vidensk. 
Selsk. Skr. (Biol.), 5, 1-34. 

Spath, H. (1980). Cluster Analysis Algorithms . Halsted Press, New York: 
John Wiley. 

Wishart, D. (1969). "Mode Analysis" in Numerical Taxonomy , edited by A.J. 
Cole, New York: Academic Press. 

Wong, M.A. (1980). "A hybrid clustering method for identifying high-density 
clusters". Working Paper #2001-80, Sloan School of Management, 
Massachusetts Institute of Technology. 



7257 "oSi* 



OC 5 



'82 



mm^r 



Lib-26-67 



HD28.IV1414 no1213- 81 

Wong, M. Antho/A kth nearest neighbour 

" iliiiiiiiiiililliiiiiiiliil 

3 TOflD DD2 DOS T21