Faculty Working Papers

INTERTECHNIQUE CROSS-VALIDATION IN CLUSTER ANALYSIS

A. Marvin Roscoe, Jr.

Jagdish N. Sheth and

Welling Howell

#208

1

College of Commerce and Business Administration

University of Illinois at Urbana-Champaign

FACULTY WORKING PAPERS College of Commerce and Business Administration

University of Illinois at Urbana -Champaign

September 25, 1974

INTERTECHNIQUE CROSS-VALIDATION IN CLUSTER ANALYSIS

A. Marvin Roscoe, Jr.

Jagdish N. Sheth and

Welling Howell

#208

Intertechnique Cross -Validation in Cluster Analysis

A. MARVIN ROSCOE, JR. JAGDISH M. SHETH, and WELLING HOWELL *

Clustering methods are often used in marketing research to define homogeneous market segments, and it should "be determined in these studies that the derivec clusters represent actual clusters. However, replication or external vali- dation is not always practical. An alternative procedure , cross-validation using intertechnique comparisons, is described in a study of geographical market heterogeniety for the telephone industry.

* A. Marvin Roscoe, Jr. and Welling Howell are Marketing Supervisors in the Market Research Section of the Marketing Department of the AT&T Company. Jagdish N. Sheth is I.B.A. Distinguished Professor and Research Professor at the University of Illinois, Urbana - Champaign.

Digitized by the Internet Archive

in 2012 with funding from

University of Illinois Urbana-Champaign

http://www.archive.org/details/intertechniquecr208rosc

Cross-validation among techniques seems essential in cluster analysis because most clustering methods tend to he heuristic algorithms instead of analytically optimal solutions. (See Joyce and Channon [6] and Frank and Green [2] for a review of the numerous clustering methods available today). As heuristic algorithms, they have no sampling theory for statistical in- ferences about the size and the number of clusters. Also, there are no ex- ternal validation procedures to ensure that the clusters derived from a specific cluster analysis are in reality the true invariant clusters. The potential statistical problem of obtaining artifacts as clusters is further compounded in some procedures which require a priori assumptions about the size and the number of clusters. Although a number of clustering methods perform statistical tests such as the F ratio or Wilks' Lambda based on analysis of variance principles to guard against obtaining random solutions , no procedure exists which will increase the assurance that a nonrandom cluster solution is in fact the true cluster solution.

Because clustering methods are used in marketing research to identify homegeneous market segments for selective marketing efforts, it is critical that the clusters derived from a heuristic algorithm are the true clusters. One procedure to ensure cluster invariance is replication which, however, is not always practical. Another procedure is the common practice in psychometrics of cross-validating the results by external validation. Surprizingly, there are very few studies in which cross-validation has been utilized to insure that the derived clusters are indeed invariant. Although several studies have pointed out the dramatic changes in the cluster structures as a function of data input [^»8] there seems to be only one published study to our knowledge which has examined the question of intertechnique validation of clusters [3].

- 2 -

The objective of this paper is to describe a cross-validation procedure which utilizes intertechnique comparisions of the clustering results. Although the actual study entailed applications of five different clustering techniques, our discussion is limited to two techniques in this paper due to space limitations. A brief description of the large scale research project is provided in which the clustering results were essential to formulating an experimental design for a field experiment.

DESCRIPTION OF THE STUDY The major research study consisted of a three factorial-6** cell experi- mentation on survey research methods. The three factors were: first, two different lengths of the questionnaire; second, four different follow-up procedures; and, third, the market heterogeneity of geographical areas of the United States with respect to consumer telephone behavior and socioeconomic - demographic characteristics (see [9]). The levels of the first two factors were predetermined based on theory, prior research and practical implications for the ongoing research on a longitudinal, national panel of telephone customers. For the third factor, it was necessary to determine the heterogeneity of the markets by empirical research which utilized clustering methods.

To define the market heterogeneity, profile data on 30,000 residential telephone customers were used for clustering. These customers are part of a longitudinal consumer panel called the Marketing Research Information System which' is maintained for the Bell System by AT&T. The panel members are selected based on a multistaged stratified sample in which the first stage of the sampling procedure consists of 100 Revenue Accounting Offices (RAOs) representing the entire Bell System. The profile consists of essentially three types of information about each panel member:

3 -

(a) his socioeconomic - demographic status and housing characteristics determi- ned by a survey conducted in early 1970 and matched with the 1970 Census, (b) his monthly telephone behavior broken dcvn into several categories as determined by the industry practice, and (c) an inventory of his telephone equipment in- cluding number and types of telephones, and additional services.

Since it was required to empirically investigate the geographical hetero- geneity of the markets, an average profile of the residential telephone customers was determined for each of the 86 PAOs for which detailed and complete information was available.

A total of 65 customer descriptors were used to represent the total profile of customers. A list of the variables is shown in Table 1. A factor analysis (principal components) solution with orthogonal Varimax rotation was performed on the data for the following reasons: (a) to reduce the multicollinearity among variables so that the profile consisted of orthogonal factor scores which are geometrically essential to calculate Euclidian distances, (b) to equalize the relative veights of each of the w ierlying dimensiors which could otherwise be easily changed by arbitrary dropping or adding of profile variables, and (c) to standardize the diverse scales of measurement common across the socio- economic, demographic and telephone information [J] . Ten significant factors were extracted from the analysis which summarized 92 percent of the total variance. A brief description of the factors is provided in Table 2.

The number of significant factors was determined using several criteria, both statistical and judgmental, following the recommendations of Rummel [10]. In addition, the stability of the factor structure was also determined by comparing the results with other data analyses to ensure the invariance of the fundamental dimensionality and structure of the profile data.

- U -

The standarized rotated factor scores for each RAO were then utilized to compute Euclidian distances between all combinations of RAOs. The resultant 86 X 86 distance matrix became the input tc the clustering procedures.

Due to the following distinct advantages, Johnson's Hierarchical Clustering method [5] was chosen as the primary clustering technique for determining the market heterogeneity. First, it is strictly empirical; second, no prior assumptions are required on the part of the researcher; and third, a hierarchical display is provided of the clusters being formed based on a function minimizing the pairwise distances among entities. While the size of the distance matrix is a limitation of the technique, it was not a problem in our case because of the relatively small number of RAOs to be clustered. Due to the structure of the distance matrix and the presumption of the "ultrametric inequality", [5, p. 2U8-9] the diameter method was chosen instead of the connectedness method in the BE-HICLUST solutions. The results are diagramed in Figure 1.

While the hierarchical clusters from HICLUST were meaningful and had strong face validity, it was necessary to cross-validate the results by at least one other technique which was essentially similar in its input requirements, analytic strategies and the output format. For this we chose the cluster analysis program developed as part of the BMDP Series which is also a hierarchical clustering routine based on sum of squares distances and the amalgamation principle [l]. In short, BMDP2M amalgamates entities based on the criterion of the smallest distance. Once a cluster is formed, consisting of at least two entities, it calculates the average profile of the cluster and treats it as if it were a new entity which is then clustered with other entities or clusters based on the principle of smallest distances. The process continues until all entities and clusters are hierarchically linked at different levels of distances. The results of the BMDP2M analysis are diagramed in Figure 2.

- 5 -

As can be seen, the two hierarchical clusters are similar in their structure and hierarchy suggesting that there is a good cross-validation between the two analyses. In order to quantitatively assess the degree of congruence between the two hierarchical clusters, two distinct statistical procedures were utilized. The first procedure consisted of calculating the correlation coefficient for the two distributions of distances at which linkages were made between entities or clusters in each hierarchical analysis. Since the number of linkages is not likely to be identical, we have selected the maximum number of links of one technique and the corresponding number of the other technique. The correlation coefficient between the sequential linkage distances is 0.99^ which is highly positive indicating extreme closeness of the hierarchical structure of the two cluster analyses.

Another procedure for cross-validation consisted of examining the clusters developed at some specific levels of distances. Based on the plotting of distances at which linkages were made, for the BE-HICLUST results a distance of 5.0 was indicated as a cutoff point due to the natural break in the curve suggesting a clear truncation.

The linkage for the BMDP2M results were also plotted and the natural break in the linkages occurred at 3.1. This was at the point where all the clusters had been formed. After this point the BMDP2M analysis indicated 15 unique entities that were not identified with any of the defined clusters. In order to produce comparable results, the cutoff point for the BE-HICLUST diagram was moved to 3-5 for the cross-validation. The clusters could be identified by their geographical orientation and have been labeled Eastern, Southern, Central and Western. Metropolitan has been used for large urban areas not specifically associated with regional areas. The clusters derived from the two techniques are marked in Figures 1 and 2 and are listed in Table 3.

- 6 -

A total of IT clusters are displayed in Table k, consisting of 13 regional clusters (Eastern, Southern, Central and Western), three metropolitan city clusters and the last one representing all the unique RAOg which could not be clustered due to their extreme distances from other RAOs. The cross-tabulation between HICLUST and BMDP2M clustering results indicates that 62 out of 86 RAOs fell on the diagonal of the crosstab matrix which represents a hit of 72 percent correct classifications in terms of intertechnique results. Further- more, most of the off -diagonal elements generally fall across clusters within the same geographical region. In Table 5, a cross-tabulation at the regional level is provided which shows that 75 out of 86 RAOs could be correctly classified on an intertechnique basis. This represents a hit of 87 percent.

While the two results are quite comparable, there are differences in the example worth noting. The BE-HICLUST algorithm appears to provide a more logical structure to the clusters which are grouped by region as indicated in Figure 2. In addition, the BE-RTCLUST method seems to work better where large distances are involved, associating 8 of the Ik unique entities with meaningful clusters. Such differences reinforce the need to use several techniques and to understand the advantages of each especially where the researcher's judgement plays such an important role .

SUMMARY AND CONCLUSIONS

We have pointed out the need for intertechnique cross-validation in cluster analysis due to the heuristic nature of most clustering procedures and the subjective judgements required to interpret the results. In this paper } we have also presented a concrete application of two statistical procedures which enable the researcher to quantitatively measure the con- gruence of structure and content of clusters across techniques. The first consists of a correlation coefficient index calculated on the distributions of distances at which sequential linkages are made among entities or clusters or both. The second consists of a cross- tabulation of specific clusters derived across two different solutions .

In this paper, the intertechnique cross-validation procedures have been applied with respect to two hierarchical clustering procedures in which the problem was the determination of geographical heterogeneity of markets for the telephone industry. This application considered the general housing and population characteristics along with a complete profile of telephone behavior. However, other uses of the intertechnique cross-validation procedure have been made by the authors for a variety of telephone behavior and markets.

Table 1 LIST OF VARIABLES

Housing

1. Own -rent home

2 . Type of residence

3 . Number of rooms

Mobility

4. Length of residence

Head of Household

5 .

Sex

6.

Age

7.

Education

8.

Occupation

13-29

30-41

42-53

54-65

9, 10, 11. 12 13.

14 15 16 17

Billing Items 12 months

Local service Local message Intrastate long distance Interstate long distance

Family

Income

Number in family Average Age Life cycle SES status

Telephone Service and Equipment

Class of service Grade of service Number of telephones Number of vertical services

Table 2

FACTOR DIMENSION LABELS

1. Local service billing

2. Local message billing

3. Intrastate long distance

4. Family - housing

5. Interstate long distance

b 7 8 9 10

Life cycle

Service* and equipment Interstate long distance 1 * Interstate long distance 2 * Socioeconomic characteristics

* The two factors for interstate long distance represent different seasonal patterns of calling across geographical areas.

6

o cr»

fi

©

CO

s

►J O M EC

I

W «

P

CO

w <

I

w

W

U

O

ED

CO

■-3

o

o

h

o

La

n

h

illi

ai

x^

D

..

n

O

1

D

. i_

1 1 i

r

h

i

r

h

o

O

© CO

q

9

= O l> p «s

: O «£ </; o. fc ir. t- - :-. H L- A O o (5

: V H p, Ej * H 5 P

i k tj i. kuoij

5Si

fc£

- < S n 5 < a < 2 & < 5 B 3 K f: t f ^ k 11 a: S

n P. < 3 w n « ■:■ S u! :■ >. ^i h b c li <c »; a 01

Pa

£> to 0

J V

V V

J V

JV_/ V

J V

V \_/

\_

J

\J

o

<6

© in

o

o

St!

-

ass

3SS

uj Ci

£fc33£:

B- H O < .

o £

°=S -

IP a: 2 i P « fc

tf <■-> «•

.y,.. a

. 03 W K> )

:ag

M M 0-

< ^ =! j* @ ^ < .-: 51 d ^ «< E i < w < o o =; J X

< s

on .

< -i . < «: .j ^ x i- ;»: oho a. ^

>»,zSOJ.fflK<Of<KI<(;WMK

O

- H

.ywv-

</! S= Z IE ?/> [

ffl J(K H

- -

X H ?-< IE X

Jh<CCa40Pi|E j

U1 U U 4 Z U 9 Ai «

•a

CD

J -

»

cfl

-ij

-i

O

'J'

V

-h>

*

C

rH

-

>>

Oi

t1

al

a)

o

.t)

M

i— l

5

■-

^>

-►->

~

•-■

+->

a

'

>>

■f->

•-1

cu

w

u

3

a>

'3

»

■%•

OJ

d

H

(4

rH

-J

rJ

t<

(1)

-t->

g-

n

(J'

>

+J

H

•P

-

a

<a

..

fll

*

&

a>

rH

ru

«

Ui

z

(a

4) S

<-l

Oi

J

£0

-

OS CD

-t->

CO

s til

to

s

+3

p

-

rH

3

n

cu

u

CU

>

a

B

H

CO

m

3

0

<u

H

U,

rQ

0

al

E-»

0

S5 M

G

(0

£..

.V,

(Ml

-J

•H

-;

6

0

«

*

«

o

-.

a,

0

d

<u

-

+->

rH

-

s-

-

Ci>

IX

*->

■^

•w>

>

H

1)

aj

<;>

CO

55

-o

-

-M

r-

CD

'

r.

>■

<-<

B

a

EH

•H

W

c

3

o

o

o

&u

1

O

m

C*J

SS

4)

M

iH

£H

,"

CO

erf

M

E-

(— !

"— **

a

10

•p

.54

-K5

c

4J 41

a)

•r-i

o

.J

•+-1

J!

p

(X,

P.

I

to

-J

rH

r-\

-t->

Ih

-.

P

X

s

rH

n)

o

-f

hu

41

•rH

s

li

ft

.0

a

C

(0

a.

*J

»J

CO

H

P

p -H

a1

p

0)

CO

(U

CO

>

P

Pi

o

s

<o

a>

0/

J3

0)

P

rt

CO

w

■*

7>

to

CO

Ph

0)

c*

•H

a

=i

till

1 CD

2:

a.

CD

£0

p

CO

.

15

CO

CO

-

.0

3

CD

p

»H CO

&

3

,H -^

M

O

—j

P

-

CJ

a

r-.

X

r*

-

-1

1

M

-c

d)

'

O -i

X

CO

-:

*

a)

v

+3

•p

p

u

c

-

1

>

to

H

CO

'J

5-

t

-

d

w

a)

crt

a>

>■.

CO

o

3

41 P

u

.c

ffl

z

tw

'-a

<C

-1

CO

a,

f-i

•r-t

rH

CO

P

a;

a,

|H

H

-

>j

P

cu

O

rH

<u

p to

a)

">

•H

rH

•rH

+5

CD

0

Ct

--

X

►H

o

c

a)

4J

■H

rH

o

n.

.

>-

—i

<->

0)

*-

'

£

6^

CO

3

c_>

H 35

I

rH

C

-

O

>j

rH

OJ

5

■X

+J

-

£

P

~-

r-i

:

CO

1

i

■p

. .

(■',

u

'

i-H

jS

.

.

to

,H

^

; ;

+>

+•>

+J

rH

|

»H

co

rH

rH

crt

cd

3

■o

<c

£

uj

+>

m

<U

*

to

+^>

^

►J

v

1

u

-

!—

cu

X

1

-;

w

OB

1

«

-

eye

3

.H vt

&

CM H

s

vO

0> H

■S

E-i

O

o

Q

H

Z

H CM o

<\i >*■

o

a

8

!M O

-

S3

D

VO rH

h rjc^-sf

OJ c"v

CO

O

S

2 w

H CO

s

fcj

8

Z

o

z

E-i

z

<;

E- 1—1 ►J

£

CO

d?

a

H

s

1-1

O

g

SS

<

Eh

H

iJ

O

P-.

LT\

O

2

H

H

s

CO

S

§

H

«\

w

o

o

z

o

M

Eh <

I

O

S

t— 1

W CQ

b en cy 0

■— -' " -'

H

NT

5 g

'

rH

1 >

r>

OJ

2

M

<— 1

O C\J

PA

P-.

0

H

s

rH

C\J

Z

en

Pxj

Eh

C"\

NT

rH

rH

gs

►— 5

s

b

r— 1

0

z

CM

W

O

*r»

«•-'-

s

a

0

O

O

rH

to

z

Dd

w

Eh

H

1 i

CO

rH

<r<

pj

Z.

cv

Eh

rH

Eh

M

CO

0

S

•J

►J

HH

Z

yA

Z

C

e

0

O

a

M-i

Oh

O

en 0

g

E-<

B

EH

Eh

en

m

en

O?

CO

Z

CO

Eh

Eh

Eh

M

£

0

w

W

H

w

a

Z

CO

0

~=

§

^5

REFERENCES

1. Dixon, W.J. "BMD P Series Documentation," Health Sciences Computing Facility. Los Angeles: University of California, 1971.

2. Frank, Ronald B. and Paul E. Green. "Numerical Taxonomy in Marketing Analysis: A Review Article," Journal of Marketing Research, 5 (February 1968), 83-98.

3. Golob, Thomas F., Eugene T. Canty, and Richard L. Gustafson. "Classification

of Metropolitan Areas for the Study of New Systems of Arterial Transportation," Paper presented at the 1972 Annual Meeting of the Transportation Research Forum, Denver, Colorado, November 8-10, 1972. U. Green, Paul E. , Ronald E. Frank, and Patrick J. Robinson. "Cluster Analysis in Test Market Selection/' Management Science, 13 (April 196-7), 387-^00.

5. Johnson, Stephen C. "Hierarchical Clustering Schemes," Psychometrika, 32 (September 1967), 241-5^.

6. Joyce, Timothy and C. Channon. "Classifying Market Segment Respondents," Applied Statistics, 15 (November 1966), 191-215-

7. Morrison, Donald G. "Measurement Problems in Cluster Analysis "Management Science, 13 (August 1967), B775-80.

8. Neidell, Lester. "Comments on Typology and Cluster Analysis," Paper presented at the AMA Workshop on Multivariate Methods in Marketing, Chicago, Illinois, January 1970.

9. Roscoe, A. Marvin, Dorothy Lang, and Jagdish N. Sheth. "Experimental Effects of Follow-up Methods, Questionnaire Length, and Market Heterogeniety in Mail Surveys," Manuscript submitted for publication^ I97<4.

10. Rummel, R.J. Applied Factor Analysis. Evanston: Northwestern University Press, 1970, Chapter 15.

UNIVERSITY OF ILLINOIS-URBANA

3 0112 060296743

ns

-MB

■Hi

mm

mmM

Bffl

i mffilBKH

Hratn mm

Bat

SI

HHHB

IBi

JHHHH

ffisilS

BMMn

m

HKM

am

will m pBlMMH«

iHB^hB^B1

111

BKU

sBSra

mS

Wffl

Bill

Hi

JllliiHMHiB

1181 III

am

m

wmft

HQHfflnLu HEHm

in