(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "Knowledge discovery using genetic programming"

DUDLEY KNOX LIBRARY 

NAVAL POSTGRADUATE SCHOOL 

MONTEREY CA 93943-5101 






Approved tor public release; distribution is unlimited. 

KNOWLEDGE DISCOVERY 

USING 
GENETIC PROGRAMMING 

by 

Steven L. Smith 

Lieutenant, United States Navy 

MB. A., University of California , Los Angeles. 147b 

and 
Mohammed A. Al-Mahmood 
Captain, Bahrain Defense Force 
B.S.E., University of Petroleum And Minerals, Saudi Arabia, 1985 

Submitted in partial fulfillment 
oi the requirements for the degree of 

MASTER OF SCIENCE IN INFORMATION TECHNOLOGY MANAGEMENT 

from the 

NAVAL POSTGRADUATE SCHOOL 

December 1993 / 



REPORT DOCUMENTATION PAGE 



Form Approved OMB No. 0704 



Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, 
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments 
regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington 
Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and 
to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 



1. AGENCY USE ONLY (Leave blank) 



2. REPORT DATE 
16 December 1993 



3. REPORT TYPE AND DATES COVERED 
Master's Thesis 



TITLE AND SUBTITLE KNOWLEDGE DISCOVERY USING 
GENETIC PROGRAMMING 



6. AUTHOR(S) AJ-Mahmood, Mohammed A. and Smith, Steven L. 



5. FUNDING NUMBERS 



7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 

Naval Postgraduate School, Monterey CA 93943-5000 



PERFORMING 
ORGANIZATION 
REPORT NUMBER 



9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 



10. 



SPONSORING/MONITORING 
AGENCY REPORT NUMBER 



li. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not 
reflect the official policy or position of the Department of Defense or the U.S. Government. 



12a. DISTRIBUTION/AVAILABILITY STATEMENT 
Approved for public release; distribution is unlimited. 



12b. DISTRIBUTION CODE 

*A 



13. 

ABSTRACT 

Dramatic growth in database technology has outpaced the ability to analyze the information stored in 
databases for new knowledge and has created an increasing potential for the loss of undiscovered 
knowledge. This potential gains for such knowledge discovery are particularly large in the 
Department of Defense where millions of transactions, from maintenance to medical information, are 
recorded yearly. Due to the limitations of traditional knowledge discovery methods in analyzing this 
data, there is a growing need to utilize new knowledge discovery methods to glean knowledge from 
vast databases. This research compares a new knowledge discovery approach using a genetic program 
(GP) developed at the Naval Postgraduate School that produces data associations expressed as IF X 
THEN Y rules. In determining validity of this GP approach, the program is compared to traditional 
statistical and inductive methods of knowledge discovery. Results of this comparison indicate the 
viability of using a GP approach in knowledge discovery by three findings. First, the GP discovered 
interesting patterns from the data set. Second, the GP discovered new relationships not uncovered by 
the traditional methods. Third, the GP demonstrated a greater ability to focus the knowledge 
discovery search towards particular relationships, such as producing exact or general rules. 



14. SUBJECT TERMS Genetic Programming. Knowledge Discovery, Datamining, 
IDIS, Genetic Algorithms, Inductive Learning, Inductive Rules. 



15. 



NUMBER OF 
PAGES 85 



16. 



PRICE CODE 



17. 



SECURITY CLASSIFI- 
CATION OF REPORT 
Unclassified 



SECURITY CLASSIFI- 
CATION OF THIS PAGE 
Unclassified 



19. 



20. 



SECURITY CLASSIFI- 
CATION OF ABSTRACT 

Unclassified 



LIMITATION OF 
ABSTRACT 



UL 



NSN 7540-01-280-5500 



Standard Form 298 (Rev. 2-89) 
Prescribed by ANSI Std. 239-18 



ABSTRACT 

Dramatic growth in database technology has outpaced the ability to analyze the information 
stored in databases for new knowledge and has created an increasing potential for the loss of 
undiscovered knowledge. This potential gains for such knowledge discovery are particularly large 
in the Department of Defense where millions of transactions, from maintenance to medical 
information, are recorded yearly. Due to the limitations of traditional knowledge discovery methods 
in analyzing this data, there is a growing need to utilize new knowledge discovery methods to glean 
knowledge from vast databases. 

This research compares a new knowledge discovery approach using a genetic program (GP) 
developed at the Naval Postgraduate School that produces data associations expressed as IF X THEN 
Y rules. In determining validity of this GP approach, the program is compared to traditional 
statistical and inductive methods of knowledge discovery. 

Results of this comparison indicate the viability of using a GP approach in knowledge 
discovery by three findings. First, the GP discovered interesting patterns from the data set. 
Second, the GP discovered new relationships not uncovered by the traditional methods. Third, the 
GP demonstrated a greater ability to focus the knowledge discover)' search towards particular 
relationships, such as producing exact or general rules. 



111 



6 J 



TABLE OF CONTENTS 



I INTRODUCTION 1 

A. BACKGROUND 1 

B. OBJECTIVES 3 

C. RESEARCH METHODS 3 

D. THESIS ORGANIZATION 4 

II KNOWLEDGE DISCOVERY 5 

A. BACKGROUND 5 

B. WHAT IS KNOWLEDGE DISCOVERY? 6 

1. Patterns 7 

a. Pattern Representation 9 

2. Certainty 10 

3. Interesting 12 

4. Efficiency 13 

III EVOLUTION OF KNOWLEDGE DISCOVERY SYSTEMS .... 15 

A. STATISTICAL APPROACHES 15 

1. Statistical Association Methods 16 

B. INDUCTIVE LEARNING 17 

1. CLS and ID3 20 

2. AQ Algorithms 21 

3. IXL/IDIS 23 

C. SUMMARY 24 

iv 



"•* >->t^i\r\r\ i 



NAVAL POSTGRADUATE SCHOni 
MONTEREY CA 93943 5 fo, L 



IV GENETIC ALGORITHMS / GENETIC PROGRAMMING 2 6 

A. GENETIC ALGORITHMS 2 6 

B. SAMUEL: A RULE GENERATING GENETIC ALGORITHM . . 29 

C. GENETIC PROGRAMMING 31 

V NPS GENETIC PROGRAM (NPSGP) 33 

A. BACKGROUND 33 

1. Data Description 33 

B. CONTROL PARAMETERS 35 

1. Fitness Measure 35 

2. Selection Criteria 37 

3. Population and Generation 38 

C. EVOLUTION OF RULES 3 8 

D. AN EXAMPLE OF THE GP PROCESS 40 

E. GP EVALUATION RUNS 44 

F. SUMMARY 46 

VI COMPARISON OF KNOWLEDGE DISCOVERY RESULTS .... 47 

A. STATISTICAL ASSOCIATIONS 48 

1. Results: 49 

B. IDIS 51 

1. IDIS Results 53 

C. NPSGP 54 

1. NPSGP Parameters 55 

2. Fitness Functions 56 

3. Fitness Function Results 58 

v 



a. Example A: 58 

b. Example B: 59 

c. Example C: 60 

4. Function Set Results 60 

a. Example D: 61 

D. COMPARISON OF RESULTS 61 

VII CONCLUSIONS AND RECOMMENDATIONS 64 

A. CONCLUSIONS 64 

B. FUTURE RESEARCH 66 

1. Fitness Function Applications: 66 

2. Different Control Parameters 67 

3. Develop Measurements of Rule Fit 68 

4. Database Objects 68 

C. FUTURE DOD APPLICATIONS 69 

APPENDIX A TEST DATA 70 

APPENDIX B IDIS PARAMETERS . . . / 72 

APPENDIX C NPSGP USERS GUIDE 74 

LIST OF REFERENCES 76 

INITIAL DISTRIBUTION LIST 78 



VI 



I INTRODUCTION 

A. BACKGROUND 

In his book POWERSHIFT, Alvin Toffler sees the 
acquisition of "knowledge" as the leverage behind the 
changing political and social structures in today's 
technologically oriented world. Industrial knowledge, for 
example, has been used to gain competitive advantage, 
allowing increased efficiency and the ability to underprice 
competitors. Similarly, the cornerstone of the United 
Nations Coalition success in Desert Storm was the absolute 
knowledge advantage demonstrated by a technological, 
tactical and training military superiority over Iraqi 
military forces. 

Historically the acquisition of knowledge has been a 
time and labor intensive undertaking. But leveraged by 
computer technology, we now live in an era where the 
velocity of social, economic and political change has 
accelerated dramatically. To keep pace and understand these 
changes, there is a real need to develop new and automated 
methods of discovering knowledge that circumvent the 
traditional time consuming methods. 

In the Department of Defense, for example, there are 
vast databases composed of millions of computerized files 



recording logistic and maintenance actions. More 
specifically, at Naval Sea Logistic Center (NAVSEALOGCEN) in 
Mechanicsburg Pa, the Navy's Material Maintenance Management 
(3M) system has recorded millions of shipboard maintenance 
actions in a vast database. These recorded transactions give 
detailed information regarding maintenance actions; such as 
the types of equipment and system failures; required parts 
and actions codes representing steps to correct the problem. 
Across the street on the same Navy base in Mechanicsburg, 
the Navy's Ships Parts Control Center (SPCC) has created a 
vast database of repair parts usage data. This database 
contains information on shipboard parts usage, including 
historical demand, manufacturer, quality assurance codes, 
etc. 

These databases offer a potential harvest of new 
knowledge for future applications of knowledge discovery 
systems. For example, by using both the 3M and SPCC 
databases, researchers could develop associations between 
the types of equipment failures, repair parts usage and mean 
times between failure to develop predicative associations 
for the next failure. Instead of waiting for equipment 
casualties, maintenance personnel could use this information 
to replace parts in a high risk category of failure. 

This thesis reviews different computer aided methods in 
the acquisition of knowledge. First, it discusses the more 
traditional methods of developing deductive and inductive 



associations between data leading to the discovery of 
knowledge. After discussing the limitations of these 
approaches, the thesis demonstrates that newer 
nontraditional methods of knowledge discovery, that is, 
genetic programs, can be used successfully as an alternative 
approach to knowledge discovery. 

B. OBJECTIVES 

The objective of this research is to determine whether 
Genetic Programming offers insight into relations among data 
elements not readily discovered by traditional methods. The 
ability to develop knowledge is represented by the 
capability to generate meaningful associations among 
elements of a database and express these relationships in 
terms of patterns expressed as rules. This thesis focuses on 
inductive methods of learning relationships among attributes 
from a data set and formalizes these relationships into 
heuristic rules. 

The objective is to compare the ability to produce rules 
from a database using genetic programing with the capability 
of with traditional statistical and inductive methods. 

C. RESEARCH METHODS 

The focus of the research is to produce rules from 
quantitative data using a genetic program developed at the 
Naval Postgraduate School (NPS) . The GP system initially \/ 



produces random data associations (rules) and then evolves 
these rules through an evolutionary adaption and selection 
process. The same data set was analyzed using traditional 
statistical methods and a commercial software program using 
deductive/inductiv- methods to develop rule associations. 
The three techniques focuses are compared according to the 
types of data associations they produce, rather than 
assessing the value of the information produced. 

D. THESIS ORGANIZATION 

This document is structured to cover a variety of key 
concepts prior to comparing the rules- generated from the NPS 
genetic program and other statistical/induction/deduction 
based methods. First, the general concepts behind knowledge 
discovery is discussed in Chapter II. The evolution of 
knowledge discovery systems is covered in Chapter III with a 
review of the apparent strengths and weaknesses of these 
traditional systems. Chapter IV discusses the field of 
genetic algorithms and programs. Chapter V reviews the GP 
developed at the Naval Postgraduate School for this research 
and Chapter VI lays out the results of the data analysis of 
the different approaches. Chapter VII concludes the research 
by comparing the rules generated and .offers suggestions for 
future applicability involving rule generating genetic 
programs . 



II KNOWLEDGE DISCOVERY 

A. BACKGROUND 

The recent proliferation of inexpensive computer memory 
has caused extensive growth in the size and amount of 
existing data storage capacity. One estimate states the 
amount of stored information in the world doubles about 
every two years [Ref . l:p. 2] . This vast storehouse of 
unanalyzed data in turn has created a surge in demand for 
automated methods of shifting through and compressing this 
raw data into meaningful information, i.e., useful 
"knowledge" . 

In addition to the dramatic increases in mass storage 
capacity of digitized information, there are three other 
major technologies in the field affecting the direction of 
information technology and the information infrastructure 
that supports the foundation for knowledge discovery: on 
line databases, networking and digitization of information 
[Ref. 2:p. 18] . 

On line databases are a major technological advancement 
playing a role in knowledge discovery. Historically, man's 
knowledge was archived in the form of books and drawings. 
Digitizing this information for accessibility offers 
tremendous advantages over textual information. On line 



databases represent a vast and vital storehouse of 
information that facilitates both the necessary functions of 
data access and retrieval. 

Networking, the connection of computer systems through 
local and wide area networks, also plays a vital role in 
developing future knowledge discovery systems. Openness, 
connectivity, data sharing and interoperability all share 
the same ability to transfer data to remote sites almost 
instantly. 

Finally, most of the current world's data is not in a 
digitized medium. The challenge is to translate vast amounts 
of accumulated knowledge into binary formats for future 
knowledge discovery. The current method for translating this 
historical information is through Optical Scanning. Although 
a relatively new technology, optical scanning represents a 
potential bridge between the written past and current data 
storage techniques. 

Before going further into the methods of knowledge 
discovery, the salient characteristics of knowledge will be 
reviewed to gain insight into the nature of knowledge vice 
unsubstantiated or meaningless associations. 

B. WHAT IS KNOWLEDGE DISCOVERY? 

Knowledge discovery is the nontrivial extraction of 
implicit, previous unknown, and potentially useful 
information from data. [Ref. l:p. 3] 



If this definition seems somewhat vague it hits at the 
core definition of knowledge. What is nontrivial to one is 
vital to another. What is unknown to one may be common 
knowledge to others. The problem is a lack of a singular 
defining concept of knowledge to determine if knowledge has 
actually been discovered. The approach used in this thesis 
is to envelope the concept of "knowledge" by defining it's 
attributes. The attributes that are used to surround the 
concept of knowledge are discussed in detail in the 
following sections. [Ref. l:p. 3]. 

1. Patterns 

One way to look at patterns is as a collection or 
class of records sharing something in common . . . 
Discovering pattern classes is a problem of 
pattern identification or clustering. [Ref. l:p. 
15] 

The first attribute of knowledge and knowledge 
discovery, is the concept of pattern recognition. Frawley, 
Piatetsky- Shapiro and Matheus point out two basic approaches 
to knowledge discovery through pattern recognition: 
traditional numeric methods and conceptual clustering [Ref. 
l:p. 15]. These methods of discovering "patterns" in 
databases use either statistical correlations or develop 
heuristic rules, depending on the type of knowledge the 
researcher is attempting to find. An example of statistical 
correlations is regression models, where the dependent 
variable (output) is calculated from a combination of 



independent (input) variable weights. Most regression 

techniques are parametric; they require the user to specify 

the functional form of the solution. If the underlying form 

of the function is unknown, parametric methods tend to break 

down. As an illustration, a sample database of financial 

information can be analyzed to determine the correct 

combination of attributes to maximize income, I a given 

constraints, C x .. .C n , this problem lends itself to a 

statistical optimizing approach. The results of this 

analysis are often in the following form: 

INCOME - a*Var x + b*Var 2 ... c*Var n - d*^...- d*C n 

(where INCOME is the dependent variable, a...d are 
constants, Va^ n are income independent variables 
and C-l n are constraint (cost) independent 
variables . ) 

Such approaches are termed compensatory. They are 
based upon the assumption that trade-offs between relevant 
attributes will maximize (or minimize) the overall 
evaluation [Ref . 3:p. 217]. There are, however, two basic 
concerns with this approach, despite evidence of strong 
predictive performance. First is the concern that 
compensatory methods are good models for developing data 
associations, particularly when dealing with non- 
compensatory associations. The second concern is that 
numerical coefficients provide limited insight into 
relationships among the other attributes [Ref. 3:p. 218]. 



Conceptual clustering, the alternative approach, 
works with both nominal and structured data. This is a 
definite advantage over the liner models when analyzing 
databases. Coupled with logical representations, such as 
production systems ( for example, if X, then Y) clustering 
offers representations that overcome the restrictive nature 
of compensatory statistics. For example, using the same 
financial database to identify inductive relationships 
results in clustering the data into groups and measuring the 
validity of these associations by hypothesis testing. From 
these methods heuristic rules are produced to explain data 
associations in production contexts: 



IF 








a l 


< 


Var 1 


< SL : 


AND 








*>1 


< 


c 2 


< b 2 


THEN 








INCOME = 


HIGH 



(INCOME is the dependent variable, Var 1 n and 
constraints C 1 n are the independent variables and 
a i 2 ' ^l 2 are' constants . ) 

a. Pattern Representation 

Once patterns have been recognized, they must be 
represented. The intent is to convey statistical 
associations to a variety of users who may or may not have a 
basic understanding of statistics. Therefore, these patterns 
should use standard representations, communicating complex 
associations. Logical representations are more natural than 



statistical representations for computation and can be used 
in natural language forms. Common logic representations 
include the production rules, mentioned earlier, relational 
patterns ( X > Y) and decision trees (equivalent to ordered 
lists of rules) . 

Because a combination of logical formats and 
natural language is easier to interpret than complex 
equations, this research focuses primarily on the inductive 
style of clustering patterns and presenting them in terms 
of production rules. The use of production system formalism 
is an important advantage since it is apparently consistent 
with human reasoning and therefore more easily understood. 
[Ref . 3:p. 218] . 
2. Certainty 

Rarely is knowledge absolute, particularly when 
dealing with data containing inexact and missing elements. 
To overcome noisy and inexact data, knowledge researchers 
need quantitative measures indicating the level of trust 
they can place on the knowledge developed by the database 
discovery programs. Without the ability to attach a 
subjective level of faith or quantitative measures of 
confidence, patterns discovered from databases become merely 
suppositions. Therefore, they never achieve the status of 
knowledge . 



10 



Computing a measure of certainty involves not only 
the integrity of the data analyzed but also the data sample 
size. As a means of representing uncertainty, this requires 
the developing probabilistic information regarding generated 
rules. One common technique is to augment logical 
expressions with probabilistic weights indicating 
probabilities of success. Simple contingency tables can be 
used either to test predictive validity or lack of 
statistical association between categorical attributes when 
a rule agrees or disagrees with the data set. These 
contingency tables represent the ability of production rules 
to fit the data set using two components, dependent and 
independent variables. In terms of a production rules , 
independent variables are the "input" attributes and values 
that produce a rule associated with an "output" category of 
the dependent variable. Expressed in IF X THEN Y format, 
the dependent variables represent the right hand side (RHS) 
of the rule while the independent variables represent the 
left hand side (LHS) . For example: 



11 



Basic Contingency Table 

("Hit" implies the rule conditions were meet in 
the database. ) 

(RHS) 
DEPENDENT VARIABLES 
HIT NOT HIT 

(LHS) 
INDEPENDENT HIT a b 

VARIABLES NOT HIT c d 

(Where a are data set and rule category agreements, 
b and c are errors of omission of either the rule 
category or data set, and d represents misses to 
both arguments) . 



Such contingency tables can measure trust in the 
rules by several methods. As an example, simple confidence 
factors, such as (a / (a + b) ) , can be used to indicate the 
accuracy of the rules in the database. Statistical 
confidence can be estimated for the parameters of the 
contingency table using binomial distributions. Contingency 
tables can also be used to develop frequencies of cross 
classification, joint and marginal probabilities. [Ref. 4:p. 
175] . Binomial standard deviations can also be developed 
for the conditions in the contingency tables (HIT-HIT, NOT 
HIT-HIT, etc. ) . 

3. Interesting 

Discovered patterns are meaningless if the 
information contained within these patterns is without 
relevancy to the user. Patterns must be interesting, that 
is, useful. "Knowledge is useful when it can achieve a goal 
of the system or the user." [Ref. l:p*. 4]. 



12 



To achieve usefulness, knowledge discovery methods must 
perform functions that filter trivial from non- trivial 
information based on the user's interest. But such a 
filtering in itself can be a double edged sword. A 
knowledge discovery system that incorporates functions for 
predefining attributes of interest can also de facto limit 
the scope of knowledge discovery. For example, a researcher 
may be interested in relationships between certain dependent 
and independent variables, to the exclusion of other 
variables in the database. Even though a researcher may be 
searching for patterns about variable X that involve 
variables Y and 'Z, the researcher may not know what those 
patterns look like. They may involve other variables such as 
A and B. The key to this dilemma is to provide the 
discovery system the ability to define "areas of interest" 
and still allow enough autonomy in mining to data for 
discover unanticipated patterns. 

4. Efficiency 

There are several efficiency measures for a 
knowledge discovery system. A major -factor in efficiency is 
the degree of processing time required per unit of data 
analysis. Searching for knowledge is usually accomplished in 
large databases -requiring significant memory and CPU time. 
Mitigating this concern are the fact that large knowledge 
discovery programs, however, are not routine jobs and are 



13 



usually done in the background of other computer 
transactions . 



14 



Ill EVOLUTION OF KNOWLEDGE DISCOVERY SYSTEMS 

A. STATISTICAL APPROACHES 

Traditionally, relationships in data have been 
discovered through the statistical methods of deduction and 
inference. In statistical methods, the quantitative 
properties of information are categorized into either 
descriptive or inferential statistics [Ref. 5:p. 2]. 

Descriptive statistics is a method for organizing and 
summarizing information. For example, breaking down data 
into categories and developing percentages represents a 
descriptive approach (e.g., 1992 election was between three 
parties; republicans, democrats and independents who 
received 46, 38 and 12 per cent of the popular vote) . 
Inferential statistics is a method for developing 
conclusions regarding the data by analyzing a sample of the 
data. Inferential statistics relies on mathematical models 
of distribution, e.g., based upon a random sampling of 500 
people, CNN newswire stated that 43 per cent of Americans 
approved of President Clinton's handling of events in 
Somalia. To effectively classify and generate interesting 
statements regarding analyzed data, knowledge discovery 
systems must employ both descriptive and inferential 
statistics . 



15 



1. Statistical Association Methods 

There are numerous approaches for developing 
associations through descriptive and inferential statistics. 
One is the multivariate linear regression models, as 
discussed in Chapter II B.l. Another is statistical cluster 
analysis, a form of conceptual clustering, places objects 
into groups (clusters) suggested by similarities in the 
data. Data is summarized in these clusters by similarities 
in the data characteristics. In cluster analysis, a priori 
knowledge of the data set is not required. 

Discriminate analysis is another approach that 
selects a subset of the quantitative independent variables 
that reveal differences among the dependent variable 
categories [Ref . 6:pp. 39-40] . The selection process is 
accomplished through statistical association tests, such as 
the F test, R 2 partial correlation and Wilkes Lambda tests. 
Discriminate functions are evaluated by estimating the 
probabilities of misclassifying the data [Ref. 6:pp. 909- 
910] . However, discriminate analysis, requires a prior 
knowledge of the categories. 

Another statistical analysis method is the process 
of Reification [Ref. 2:p. 407]. Reification takes a set of 
database fields (objects) and collapses them into a subset 
of smaller object fields (called descriptors) that best 
describe the object's key attributes. Reification techniques 
include factor analysis, principle component analysis and 

16 



cluster analysis. As we shall see later in this chapter, 
developing the statistical attributes of database objects is 
a key component to Parsaye's knowledge discovery engines IXL 

(Induction with extremely Large databases) and IDIS 

(Information Discovery System) . 

As we have seen, statistical analysis has 
traditionally laid the foundation for Knowledge Discovery. 
The ability to classify (descriptive statistics) and 
formulate conclusions based upon sampling the data 

(inferential statistics) has been an important first step in 
knowledge discovery techniques. Further, statistical 
analysis is also crucial in measuring the quality of 
discovered knowledge. Parsaye elaborates on the need to 
extend the statistical approach in two basic ways. First, 
because statistics is a quantitative approach, output 
results come in the form of mathematical functions or 
equations. Since the average user of a database mining 
system is not a statistician, these complex representations 
must be converted into formats the user understands. 
Second, the statistical approach should be transparently 
built into the discovery system so that the user need not be 
a statistical expert. 

B. INDUCTIVE LEARNING 

Inductive learning is defined as the ability to describe 
a class from a review or analysis of the individual objects 



17 



in that class [Ref. 2:p. 404]. Inductive inference is the 
basis of inductive learning and is the process of 
generalizing assertions from specific observations about 
objects in classes of data. The inductive process takes an 
initial inductive hypothesis and develops assertions (rules) 
that can account for the observations of objects within the 
classes of objects. 

There are two basic approaches to the inductive rule 
formulation, data-driven and model -driven. Model -driven 
methods expresses an a priori model of the data, in terms of 
production rules, and then tests these models against an 
actual data set. In data-driven methods, the data is first 
analyzed and then inductive models are developed to define 
the data set. The latter approach, data-driven models, is 
used for knowledge discovery in this research. This approach 
is seen as a more flexible means of developing rule 
structures, particularly since little' a priori knowledge may 
be available about the data being analyzed. Additionally, 
true knowledge discovery of unanticipated results is more 
likely if the knowledge search is not hampered by 
constrictive biases introduced by a priori models. 

Using a data-driven modeling approach, William Messier 
and James Hansen [Ref. 7:p. 1414] found considerable 
success in developing quality production rules for expert 
systems. They point out, however, there are a few 
limitations when using an inductive approach. First, the 

18 



data-driven method is difficult to apply to very large 
databases. Rule development is prohibitive when the 
availability and range of the variables are large and 
diverse. Second, inductive approaches can develop 
significant errors if the data set contains missing or 
erroneous values. Missing important instances or attributes 
may lead to rules that are misleading. 

The ability to inductively generalize is an important 
process of knowledge discovery. The goal of inductive 
generalization methods in production systems is to find 
classes or a range of objects in the variable set that make 
the rules more applicable. More than one dependent variable 
category may be used to broaden the independent variables. 
The intent is to maximize the applicability of the attribute 
values. In rule induction, RHS dependent variables are 
broadened to include a wider range of the independent 
values. For example, a rule using a RHS dependent variable 
containing three categories, K x K 3 , could match a larger 

number of LHS independent variables (attributes) if the 
algorithm allowed for more than one RHS category to be 
included as an observation in the rule. 

Generalization can also be accomplished by backing off 
from the exact values of the independent variables in order 
to gain a larger sampling of the data set, e.g., variable X 
may only apply to few instances in ( 1 < X < 3) , where 
expanding the range of X ( - 100 < X < 500) may broaden the 

19 



applicability of X to a larger portion of the data set. 
Exact rules, on the other hand, limit the data sets to match 
one dependent variable and include only HIT-HIT situations. 

The following section discusses a few of the early 
inductive rule production systems based on the data-driven 
approaches. As these systems developed they found techniques 
for dealing with the above problems and evolved into more 
powerful knowledge discovery systems. 
1. CLS and ID3 

An early inductive learning technique is the Concept 
Learning System (CLS) algorithm developed by Hunt et al 
(1966). The object of the algorithm is to take objects of a 
known class (categories of the dependent variable) described 
in terms of the attributes (independent variables) to 
generate a production system which classifies these given 
objects. In CLS a decision tree is developed by repeatedly 
segregating the data into smaller and smaller subsets. These 
subsets each held certain characteristics of the attributes 
that classified them into separate categories. 

Quinlan (19 79) eventually developed CLS into a more 
sophisticated program. This program, ID3 , uses induction to 
develop a decision tree processes to classify and break the 
data down into a set of rules. ID3 ' s inputs are a known 
class of data described in terms of a fixed set of 
attributes. ID3 ' s output is a decision tree that classifies 



20 



the given cases of information. ID3 incorporates a top down 
induction approach using divide -and -conquer techniques to 
split the data into interesting attributes. The inductive 
algorithm constructs a rule by incrementally building a 
classification tree and repeatedly adding additional 
attributes to underspecif ied branches based on the estimated 
discriminatory power of that branch. This process continues 
iteratively until the data is fully partitioned. 

ID3 is perhaps the most popular method of decision 
tree analysis because it is easy to implement and produces 
simple decision trees effectively [Ref. 8:p. 172]. However, 
as Parsaye and Hansson [Ref. 9:p. 141] point out, ID3 can 
easily generate too many decision trees with meaningless or 
uninteresting rules. Additionally, they find several other 
faults in ID3 . First, ID3 is very sensitive to changes in 
the database. Small changes to the data can potentially 
produce large variations in the decision trees. Second, ID3 
cannot generalize about the database. It produces exact 
decision trees, not general rules. Finally, similar to the 
second fault, ID3 cannot deal with inexact data and 
therefore cannot produce inexact recommendations. 
2. AQ Algorithms 

Parsaye 's analysis of CLS and ID3 found that the 
decision tree approach to rule generation lacks robustness. 
Producing exact rules may help segregate the database into 



21 



unique subsets but severely cuts down on the ability to open 
up the data analysis to more useful general interpretations. 

The AQ family of algorithms is another approach to 
knowledge discovery that sought to overcome the lack of 
generalization. These algorithms use both type and 
structural information of the database to generalize based 
on a given data samples. Constraints are applied to limit 
the focus search patterns. The basic idea is to select one 
example, generalize on the example to determine how much of 
the generalization applies to the database, ensuring the 
generalization doesn't violate the constraints. The Star 
system (Michalshi 1983) is a "knowledge generalization" 
system because of its ability to use an inductive and 
generalization process with applied user constraints. 
Parsaye discusses a definite problem with the AQ based 
algorithms programs when applied to large databases. Too 
many generalizations may be produced causing the value of 
the conclusions to be too "liberal." Additionally, the AQ 
algorithms could not deal with inexact data and tended to 
produce nonsensical rules. 

The previous sections have discussed statistical 
inductive approaches to knowledge discovery. As pointed out, 
each approach has its own strengths and weaknesses. 
Statistical methods can be overly complex and difficult to 
use. On the other hand, inductive methods , such as CLS, 
ID3 and AQ may have a tendency to produce either too exact 

22 



or too liberal rules. A new approach is required that 
combines the strengths of both statistical and inductive 
methods while overcoming their inherent limitations. This 
new approach was introduced by Parsaye with IXL. 
3 . IXL/IDIS 

Parsaye combined the statistical and inductive 
approach (machine learning) in his knowledge discovery 
program, IXL (Induction with extremely Large databases) . 
IXL "... form (s) and test(s) various hypotheses about 
relationships in the database and uses machine learning 
algorithms to generate rules." [Ref. 10 :p. 2] . At the 
heart of IXL, and the later version of the program IDIS, are 
two core modules. The statistical module analyses the data 
for statistical relationships and the artificial 
intelligence (AI) module interprets and tests these 
relationships then transposing them into easily understood 
rules [Ref. I0:p. 3]. The sophistication of IXL/IDIS comes 
into play when the program goes beyond sorting simple 
database descriptive statistics. By forming hypotheses on 
the statistical relationships, testing the hypotheses and 
repeating the process until patterns emerge, IXL/IDIS offers 
a clear alternative to the over generalization problems of 
CLS, ID3 and AQ algorithms. IXL/IDIS is able to circumvent 
the generalization problems that have plagued earlier 
knowledge discovery programs by allowing user defined 



23 



constraints on the AI module of pattern analysis. In 
addition, IXL has the ability to search large databases ( in 
the hundreds of millions of records) . For these reasons 
IDIS, the updated version of IXL, was chosen to benchmark 
the genetic programming approach developed at the Naval 
Postgraduate School . 

IXL/IDIS has seen some success in commercial 
applications. The U.S. Army and Air Force exchange is an 
example were it has been applied to determine sales patterns 
based on the customer demographics. IXL helped them target 
sales and advertising towards the appropriate customer base. 
[Ref . 9:p. 157] . 

C. SUMMARY 

This chapter we has reviewed the ability to capture 
large scale databases through mass storage and on-line 
databases connected by increasingly faster networks and the 
transfer of textual data to binary formats. Coupled with a 
vast increase in computer capacity, these innovations help 
deal with complex data and seek meaningful relationships 
among elements in the database. Recent attempts at knowledge 
discovery have shifted from traditional statistical 
techniques involving deductive analysis to more declarative 
techniques involving inductive analysis. Inductive 
algorithms, such as CLS and ID3 , have shown great promise 
but are too restrictive to produce general rules. On the 



24 



other hand, AQ algorithms create too many generalizations 
producing many nonsensical rules. A new breed of systems 
(such as IDIS) that combine both statistical and inductive 
approaches are beginning to exploit the strengths of both 
approaches . 



25 



IV GENETIC ALGORITHMS / GENETIC PROGRAMMING 

A. GENETIC ALGORITHMS 

Genetic Algorithms (GA) emulate the natural science of 

evolution, the science that Darwin interpreted as the 

evolutionary process that adapted species to their 

environmental conditions. Genetic algorithms can be viewed 

as an adaptive search procedure which is based on the model 

of population genetics and natural selection. Koza defines 

Genetic Algorithms as follows: 

Genetic algorithm is a highly parallel mathematical 
algorithm that transforms a set (population) of 
individual mathematical objects (typically fixed 
length character strings patterned after chromosome 
strings) , each with an associated fitness value, into 
a new population (i.e. the next generation) using 
operations patterned after the Darwinian principle. 
[Ref. 11: p. 18] 

Genetic algorithms solve problems by evolving potential 
solutions through a process of randomly recombining the 
critical aspects of the problem. Problem conditions, 
actions or characteristics are typically represented as 
binary strings of data which are combined through genetic 
operations such as mutation or crossover with other possible 
conditions, actions or characteristics of the problem. 
Eventually, an optimal condition is achieved after numerous 
generations of "breeding" possible solutions. Some of the 



26 



basic terminology used in genetic algorithms follows [Ref . 
ll:pp. 18-26] : 

1. Population: This represents the set of candidate 
solutions, or individuals. 

2. Generation: Is one iteration of the genetic algorithm. 

3. Crossover: Is the process where two individuals in the 
population exchange parts of their internal 
representation to create new individuals. 

4. Mutation: Is the random change of part of the internal 
representation of an individual. 

5. Fitness: Is the method used to select individuals from 
the populations through using quantitative measurement. 
The chosen individuals represent those that exist and 
survive in the population that may successfully 
reproduce. 

6. Reproduction: Is the process by which part of the 
parent is taken, and put in the new generation without 
any alternation. 

Genetic Algorithms operate on populations of data 

(individuals) where each represents a potential set of 

problem solutions. Individuals are selected during each 

generation by their "fitness" and then recombined through 

crossover. Offsprings may undergo mutation before they are 

inserted in the new population. This process continues until 

some specific termination criteria is met. Figure 1 shows a 

general flow chart of the genetic algorithm process 

operating on strings of individuals as depicted by John Koza 

[Ref. ll:p. 29] . 



27 



Gen := 



I 



Create Initial 
Random Population 



< 



~ *— : s Ycs 

Termination \ ^. 

rit crion Satisfied?^ / 
| No 



Designate 
Result 



Evaluate Fitness of Each 
Individual in Population 



Gen := Gen + 1 






i 



iuid 



I 



I No 

P f / Select Genetic OpcrationN 1^ 

" y Probabalistically J 

U 



Select One 
Individual 

liasaLoa EUntaa 



Select Two Individuals 
Based o n Fit ness 



1 



Perform Reproduction 



Copy into New 
Population 



i := i ♦ 1 

ze: 



Select One 
Individual 

QaaaJ °p laincss 



Perform 
Crossover 



I 



Perform Mutation 



Insert Mutant into 
New Population 



Insert Two 
Offspring 
into New 

Population 



| t:=i + 1 



Figure 1 Flow Chart of the Genetic Algorithm Process 



28 



John Holland developed the genetic algorithm and 
provided its theoretical foundation in his book Adaption in 
Natural and Artificial Systems [Ref . I2:pp. 31-58]. 
Although the basics of Genetic Algorithms have been around 
since the mid 1960s [Ref. 13:p. 9], GAs have primarily been 
utilized in optimization and classification problems. Rule 
induction with genetic algorithms is a relatively new 
approach. Kenneth A. Dejong suggests using GA approaches to 
solve rule production systems. By ".. .maintain (ing) a 
population of candidate rule sets," GA can utilize the 
selection process to produce optimal rules [Ref. 14:p. 625]. 

B. SAMUEL: A RULE GENERATING GENETIC ALGORITHM 

SAMUEL, standing for Strategy Acquisition Method Using 
Empirical Learning, is a genetic algorithm program developed 
at the Naval Research Laboratory. SAMUEL is designed to 
investigate behavior in simulation models. SAMUEL in essence 
is designed to learn rules for decision making agents. 
Coupling feedback mechanisms and performance evaluation 
techniques, SAMUEL seeks to learn optimal decision 
strategies through a set of rules that evolve over time. The 
goal of the program is to use a genetic algorithm to refine 
and improve upon an initial set of knowledge strategies, 
represented by a population of rules. 

Knowledge is represented in SAMUEL through three levels 
of knowledge structures: populations, plans and rules. 



29 



Populations consists of a set of plans to deal with 
environmental simulations in the program. Plans, in turn, 
are composed of specific condition-action rules that are 
intended to represent performance strategies to deal with 
specific environmental conditions. Each rule consists of IF- 
THEN conditional statements representing a set of conditions 
and appropriate responses to those conditions (actions) . An 
example rule might be: 
Rule 10 

IF range = [250, 1000] and speed = [500, 1200] 
THEN SET turn = [0, 90] 

Performance evaluation of the competing plans is 
accomplished through the GA. Each generation contains a 
population of plans that compete and are measured by a 
fitness function. Based on the relative grades assigned by 
the fitness evaluation, the GA selects the best of the high 
performing plans for additional reproduction, crossover and 
mutation. The reproduction and testing cycles repeat until 
user specified criteria are meet. 

The advantage of SAMUEL is that it provides an important 
framework for developing a GP that analyses data to generate 
and evaluate production rules. SAMUEL is limited by the 
restrictive rule structure of IF-THEN statements. SAMUEL is 
also limited to rule structures of fixed length and does not 
incorporate combinations of logical operators, such as AND, 



30 



OR and NOT. These limitations on rule structure are removed 
by genetic programming. 

C. GENETIC PROGRAMMING 

For many problems in machine learning and artificial 
intelligence, the most natural known representation 
for a solution is a hierarchial computer program of 
indeterminate size and shape. [Ref . 11 :p. 210] 

Genetic programming is a variant of genetic algorithms 
with a different problem representation. The basic approach 
to genetic programming is summarized by John Koza in three 
steps as follows [Ref. ll:p. 213]. 

1. Generate an initial population of random compositions 
on the functions and terminals of the problem (computer 
programs) . 

2. Iteratively perform the following substeps until the 
termination criterion has been satisfied: 

a. Execute each program in the population and assign it 
a fitness value according to how well it solves the 
problem. 

b. Create a new population of computer programs by 
applying the following two primary operations. 

(i) Copy existing computer programs to the new 
population. 

(ii) Create new computer programs by genetically 
recombining randomly chosen parts of the two 
existing programs. 

(iii) These operations are applied to computer 

program (s) in the population chosen with a 
probability based on Darwinian fitness. 

3. The single best computer solution in the population is 
designated as the result of the genetic program. The 
result may be a solution (or an approximate solution) 
to the problem. 



31 



The application of genetic programming is relatively 
new. For the most part these applications have focused on 
optimization and classification problems unrelated to 
generating production rules. [Ref. 15:p. 12]. 



32 



V NPS GENETIC PROGRAM (NPSGP) 

A. BACKGROUND 

The Naval Postgraduate School Genetic Program (NPSGP) 
adapts Walter Tackett's [1993] implementation of John Koza's 
genetic program developed at Stanford University. NPSGP 
supports rule generation. Constraints can be added to the 
structures evolved by the GP. Constraints are used to define 
the formats for -the rules to be generated by the system. A 
format specified in NPSGP dictates that rules follow the 

( if X then Y ) patterns. In these rules, the dependent 
variable is on the right hand side (RHS) and independent 
variables are on the left hand side (LHS) . In genetic 
programming these independent variables are called terminal 
sets. Before going further into the details of the NPSGP 
system, the data set used to evaluate this program is 
described. This will assist in explaining the functions of 
NPSGP . 

1. Data Description 

The data set used to demonstrate NPSGP contains 
quantitative data, specifically, the basic components of 
U.S. Gross Domestic Product (GDP) for the years 1929 to 1991 

[Ref. 16:p. 341-474]. The components include Personal 
Savings, Disposal Personal Income, and Net Export s/ Import s . 



33 



Other relevant variables were also used, such as civilian 
unemployment rates and 10 -year Government Bond rates. 
Finally, some of this data was converted into ratio's , 
e.g., Personal Consumption Expenditures were divided by GDP. 

GDP Growth, the dependent variable, was defined as a 
change in GDP and was categorized into four discrete 
classes. The first category was assigned a value of high 
positive (HPOS) if Indexed GDP grew by more than 2.5 per 
cent from the previous year. Indexed GDP growth from to 
2.5 per cent was categorized as low positive (LPOS) . GDP 
growth of to -2.5 per cent was categorized as low 
negative (LNEG) . Any negative growth greater than -2.5 per 
cent was classified as high negative .(HNEG) . 

The function set consisted of the operators in the 
rule structures (i.e., IF, OR, AND, NOT). The rule is 
represented by a collection of "terms" where each term 
specifies attribute-value ranges , called selectors. As an 
example, a selector including the attribute UNEMPLOYMENT may 
have an applicable range of to 25 per cent in the 
database; < UNEMPLOYMENT < 25%. 

Such a strategy offers robust rules through 
compensatory selection; the terminal set selected by one 
variable can compensate for by the terminal set selected by 
another variable. For example, the following rule offers a 
compensatory strategy for either the change in 10 Bond rates 



34 



or the change in personal savings and gross public debt as a 

percentage of GDP: 

GDP GROWTH = LPOS 
IF 

2.2 < CHANGE IN THE 10 YEAR BOND RATE < 9.20 
OR 

0.9 5 < GROSS PUBLIC DEBT AS A % OF GDP < 2.16 
AND 

1.35 < CHANGE IN PERSONAL SAVINGS < 2.61 

B. CONTROL PARAMETERS 

NPSGP offers a number of control parameters that the 
user can adjust to tailor the program to the type of data 
and analysis. 

1. Fitness Measure 

The fitness measure is the quantitative evaluation 
of how well the rule matches the data. NPSGP uses a 
contingency table approach to establish a method of 
determine if the LHS of the rule classifies a RHS category, 
e.g., HPOS, LPOS, LNEG,or HNEG. First, an observation of the 
independent variable either belongs to a category (C+) or 
doesn't belong to that particular category (C-). Second, the 
LHS of the rule may be true for an observation (M+) or false 
(M- ) . The following contingency table shows the four 
possible conditions: 



35 





RIGHT HAND SIDE 






C+ C- 


LEFT 
HAND 






M+ 


A B 


SIDE 








M- 


C D 



A, B, C, D represents the total number of data 
points in each subset. For example, "A" represents the 
number of cases where the LHS and the RHS are true for an 
observation; this is a HIT-HIT condition. "B" represents 
number of cases where the LHS is true but the RHS is false; 
this is the MISS-HIT. "C" represents a HIT-MISS condition 
where the RHS is true but the LHS is false and "D" is the 
MISS -MISS condition where both the LHS and RHS are false. 

A fitness function may be defined in terms of the 
various cases in the contingency table. For example, the 
simplest method for computing fitness is to divide the MISS 
HIT value B by the HIT-HIT value A. To eliminate the 
possibility of division by or into zero, a small number is 
added to each of the parameters, i.e., (B+1)/(A+1). NPSGP 
minimizes the value of the fitness function. 

The following two examples illustrate how fitness 
function are used to evaluate rules. In both cases there 
are 63 data points for each attribute. The first case 
assumes that a rule produces 38 instances of HIT-HIT, 
category A, no instances of category B, MISS-HIT, and no 
instances of category C, HIT-MISS. By adding the constant 



36 



value of 1 to each of the parameters the following fitness 
measure is calculated. 

Fitness = (B+C+1)/(A+C+1) 

= (0 + + 1)/ (38 + + 1) = 0.025 

In the second example, a rule produces 30 instances 
of HIT-HIT, category A, and 8 instances of MISS-HIT, 
category B, and 3 instances of MISS-HJ.T, category C. The 
fitness calculation then will yield: 

Fitness .= (B + C + 1)/(A + C + 1) 
= (8 + 3 + 1)/ (30 + 3 + 1) = 0.3529 

The second rule did not fit the data set as well as 
the first rule. It had 8 cases where LHS rule fit the data 
set (interest rates, savings rates, etc.) but not the RHS 
dependent variable (GDP growth = HPOS ; LPOS ; LNEG; HNEG) and 
3 cases where the RHS rule fit the dependent variables but 
not the data set. The higher fitness measure indicates the 
fit was not as good. 

2. Selection Criteria 

Tournament selection is a method used to select the 
best fitting rules from the total generated population. It 
is similar to the way a winner is chosen from a tournament 



37 



among competitive teams. In NPSGP tournament selection was 
set at a value of six, resulting in a grouping of six rules 
from which the best is then chosen. Another method is a 
probabilistic selection method where the probability of 
selecting an individual depends on its fitness. For 
example, the best individual in the population may have a 
high probability ( close to 1.0), while individuals in the 
mid -range may have a probability of approximately 0.5. The 
worst individual of the group will have a probability of 0. 
[Ref . ll:pp. 604-607] . 

3 . Population and Generation 

NPSGP also allows the user to select the population 
size to be specified. A population of 1,000 in this case 
means 1,000 rules will be initially randomly generated and 
then competitively adapted through crossover. Users can 
also select the total number of generations to be run, which 
serves as the termination point of the program. The total 
number of generations and population size required to 
produce an optimal solution may be determined by trying 
several combinations of these parameters. 

C. EVOLUTION OF RULES 

The program starts with an initial population of a 
number of randomly generated rules composed from the 
function and terminal variables sets. This initial 
population will usually contain a high percentage of poorly 



38 



fitting rules that are later refined by the crossover 
process. Reproduction and crossover operations are then 
applied to breed a new population of offspring rules. This 
process continues until the termination criteria is 
satisfied. Figure 2 outlines the major processes that the 
program follows in generating rules. 





GP Program Flow 




Generate population 






I 






Evaluate & Sort 






1 




l 






Breed New 
Population 


hu /Check 
^* Termination 








J. Yes 

CStop^) 


* 



Figure 2 - GP Program Flow 



39 



D. AN EXAMPLE OF THE GP PROCESS 

The following illustrates the NPS genetic program output 
using the econometric data outlined in Appendix A. Program 
control parameters were set with a population size of 3000, 
crossover rate of 75 per cent, and termination parameter of 
50 generations. The fitness measure (B + C + 1) / (A + C + 1) 
provides an initial means to minimize the number of HIT- 
MISSES and MISS -HITS. Although programmed in a "C" language 
shell, NPSGP produces rules using the LISP list format. 
Using the above fitness measure, NPSGP reported the best 
rule, e.g., the rule with the lowest fitness value, in 
generation as follows: 



Best of Generation 0: 
(IF 
(AND 
(TWIXT 

4.760321 
UNEMPCIV 
-3.134517) 
(TWIXT 

1.770329 
BOND10YR 
-15.607920) ) 
3.000000.) 

Number of records matched by LHS : 2.000000 
Number of misclassif ied records: 0.000000 
Confidence: 1.000000 
Validation Fitness= 0.333333 



40 



The rule can be easily translated to a more 
understandable format. The above rule translates to: 

IF 

-3.134517 < UNEMPCIV < 4.760321 
AND 

-15.607920 < BOND10YR < 1.770329 
THEN 

GDP GROWTH = HNEG 

As indicated above the LHS of the rule matched two 
observations in the data set. The confidence factor, used as 
a measure of exactness of a rule, is calculated by the 
formula A / (A + B) . This formula calculates the percentage 
of correct RHS classifications to total RHS classifications, 
when all LHS instances match the data* set. In this example 
there were no miss- classifications the formula computed to 2 
/( 2 + ) or 1.0, representing 100 per cent confidence. The 
rule was exact in that it correctly classified all the 
records pertaining to that rule into a HIT-HIT situation. 
The fitness was computed as [B (0) + C (0) + 1] /[ A(2)+ C(0)+ 
1] = 1/3 = 0.33333. Constant values of 1 were added to the 
fitness denominator and numerator to avoid division by zero. 
Finally, the numeric code 3 at the end of the rule 
represents the RHS category, e.g., GROWTH = HNEG. 

Better rules were evolved to fit the data set through 
generations of reproduction. As an example, the best rule 
of generation 3 follows: 



41 



Best Rule of Generation 3: 
(IF 
(AND 
(TWIXT 

0.341340 
GDBTPERGDP 

-2.211879) 
(TWIXT 

-4.860248 
GDPIPERCHG 
0.170821) ) 
0.000000) 

Number of records matched by LHS : 32.000000 
Number of misclassif ied records: 3.000000 
Confidence: 0.906250 
Validation Fitness= 0.315789 



Translated: 

IF 

-2.211879 < GDBTPERGDP < 0.341340 
AND 

-4.860248 < GDPIPERCHG < 0.170821 
THEN 

GROWTH = HPOS 

The improvement in fitness, the NPSGP method of gauging 
rule improvement, is noted by the decrease from 0.333333 to 
0.315789. This rule applies to different LHS attributes and 
RHS category. Finally, the rule has 32 HIT-HIT 
classifications and 3 MISS-HIT. Here the fitness formula 
used, A/ (A+B) , moved the GP toward more generalized rules. 

By generation 7 the fitness measure "optimized", the 
validation fitness value of the best rule minimized at 
.263158. As shown in the example below, this rule has a net 
decrease of one MISS -HIT observation from the above example. 



42 



Best Rule of Generation 7: 
(IF 
(AND 

(TWIXT 
-6.015694 
PFDBTGDP 
0.895278) 
(TWIXT 
0.239258 
GDPIPERCHG 
-24.963970) ) 
0.000000) 

Number of records matched by LHS : 32.000000 
Number of misclassif ied records: 2.000000 
Confidence: 0.937500 
Validation Fitness= 0.263158 



Translated: 

IF 

-6.015694 < PFDBTGDP < 0.895278 
AND 

-24.963970 < GDPIPERCHG < 0.239258 
THEN 

GROWTH = HPOS 

Through crossover NPSGP was searching for the best rule 
that could fit both the RHS categories and LHS terminal set 
Although the best of breed fitness measure did not improve 
after this generation, changes continued in rule attributes 
of subsequent generations. For example, the top rule in 
generation 50 included the attributes GDBTPERGDP and 
GDPIPERCHG. It had the same number of classifications/ 
misclassif ications and confidence as the best rule in 
generation 7. However, the rule in generation 7 contained 
the attributes PFDBTGDP and GDPIPERCHG. 



43 



This improvement of the fitness measure is graphically 
illustrated in Figure 3. With the relatively small data set 
used in this application, the fitness measure very quickly 
converged to an optimal value. 



Fitness 


Fitness vs Generation 




0.38 






0.36 


• 




0.34 
0.32- 






\ 


0.3- 


\ 




0.28 
0.26 


\ 








0.24 






0.22 








01 23456789 

Generation 


10 



Figure 3. Fitness verses Generation 



E. GP EVALUATION RUNS 

To test how sensitive the fitness measure is to changes 
in parametric inputs, the GP was run with different 
population sizes and crossover rates using the same fitness 
measure (B + C) / (A + C) . First, the crossover rate was 
varied holding the population constant at 3,000. Crossover 
rates of 60, 65, 70, 75, 80, 85, 90, 95 and 100 per cent 
were tried. The results indicate that changes in the 



44 



crossover rate had a minimal effect on fitness. In all 
cases, the fitness of the "best of generation" optimized in 
generation 7 with a value of 0.263158. This result may be 
due to the relatively small size of the database. A larger 
data set using real and discrete terminal values may have 
produced quite different results. 

Another test was performed changing the population size 
to determine its affects on fitness convergence. 



POPULATION 


500 


1,000 


2,000 


3,000 


5,000 


10,000 


OPTIMAL 
GENERATION 


12 


8 


15 


7 


11 


16 


OPTIMAL 
FITNESS 


0.263158 


0.263158 


0.236842 


0.263158 


0.184211 


0.038462 



Table 1. Fitness vs Population 



As Table 1 indicates, the size of the population did 
affect the optimal fitness and the number of generations 
required to reach it. Increasing the population from 3000, 
to 5000, and then 10,000, lowered the fitness value. This 
indicates there is a minimum population size to produce the 
best rule results. The larger the selection pool the better 
statistical odds of producing rules that best model the 
data. The fitness measure (0.08333) of generation zero 



45 



(without crossover) using a population of 10,000 was better 
than the optimal fit of the smaller population after 
generation 7. 

F. SUMMARY 

Applying NPSGP to the data set revealed some important 
characteristics of GP rule generation. First, the 
composition of observations used in the fitness function is 
critical to the type and depth of rules produced by the GP. 
For example, the fitness function can be modeled to either 
produce generalized rules that include both HIT-HIT and 
MISS -HIT observations, or, it can produce exact rules using 
only HIT-HIT observations. This point is elaborated further 
in the next chapter. 

Second, it is important to review all rules produced by 
the GP that meet a desired minimum fitness measure due to 
the differences in the attributes and data range variations 
of those rules. Interesting rules and valuable data 
associations can be overlooked if users only focus on those 
rules where the program has minimized the fitness value. 
Third, population size has a significant affect on rule 
generation, including both fitness values and LHS attribute 
ranges . 



•lo 



VI COMPARISON OF KNOWLEDGE DISCOVERY RESULTS 

This chapter uses the data outlined in Appendix B to 
compare two known knowledge discovery approaches, i.e., 
statistical and inductive, against the NPSGP. The 
comparison is based on producing data element associations 
between the RHS categories and LHS attributes. As mentioned 
in Chapter III, IDIS was selected as representative of the 
inductive knowledge discovery approach because of it's 
ability to overcome the limitations of other inductive 
systems. While there are numerous statistical methods for 
discovering data element associations, stepwise discriminate 
analysis (SDA) was used to select which LHS attributes most 
significantly discriminated between the RHS categories. The 
SDA was done using the SAS statistical package and is 
described in greater detail in the following sections. 

The benchmark for comparing all three approaches, was 
the ability to identify associations among LHS data 
elements. SDA accomplishes this by identifying 
statistically significant independent variables that 
discriminate between dependent variable categories. This 
SDA list of significant variables is then compared to the 
LHS attributes produced in the NPSGP rules. The rationale 
behind this comparison is to determine if NPSGP could 



47 



develop data element associations that include variables 
found to be significant by SDA. 

Benchmarking IDIS to NPSGP meant reviewing the generated 
rules to compare LHS attributes. Any attribute listed in a 
rule was assumed to be significant. The LHS attribute ranges 
were not reviewed in detail nor was there an attempt to 
determine the usefulness of the rules. 

A. STATISTICAL ASSOCIATIONS 

SAS discriminate analysis procedures analyze data sets 
containing one dependent variable and several independent 
variables. The purpose of this discriminate analysis is to 
find the subset of those quantitative independent variables 
that are statistically "important" and can reveal 
differences among the dependent variable categories. 
However, SDA is not necessarily the best approach for all 
purposes. It has to be used carefully and reflect the user's 
knowledge of the data. [Ref. 6:p. 911]. 

Within the SAS discriminate analysis, the STEPDISC 

function was chosen to produce a discrimination model by 

selecting a subset of the quantitative variables based on 

one of two following criteria: 

1. Analysis of covariance using the F-test to determine 
significance. The variables chosen act as covariants to 
the dependent variable (GDP Growth) . 



48 



2. Analysis of the squared partial (R 2 ) correlation value 
between the dependent variable (GDP Growth) and the 
independent variables, controlling for the effects of 
other variables selected for the model. [Ref. 6:p. 
909] . 



Stepwise selection begins the process without any 
variables in the model. Variables are included into (or 
later excluded from) the model if they contribute to the 
discriminatory power of the model as measured by Wilke's 
Lambda. This process is repeated in steps until all 
variables meet either the criteria to stay or be dropped 
from the model. There is one potential weakness of this 
model: relationships between the unselected variables are 
not analyzed. This potentially excludes statistically 
significant variables. [Ref. 6:p. 910]. 

1. Results: 

The results of using SDA on the 16 independent 
variables listed in Appendix B is illustrated in the Table 
2. Table 2 first segregates the significant and non- 
significant independent variables in the model then lists 
these variables by decreasing order of partial R 2 value. The 
R 2 values represent the one-way analysis of covariance 
between the selected variable and the remaining variables 
not chosen for the model . 



49 



VARIABLE 
SELECTED 


VARIABLE NOT 
SELECTED 


PARTIAL 
R 2 


P 
STATISTIC 


PROB > F 
STATISTIC 


DNEMPCHG 




0.6217 


32.316 


0.0001 


INDXGPDIPER 




0.2276 


5.697 


0.0017 


EXPPERGDP 




0.2193 


5.5151 


0.0033 


PCEPERCHG 




0.2079 


4.986 


0.0039 


POPCHGPER 




0.1379 


2.880 


0.0443 


GDPIPERCHG 




0.1092 


2.125 


0.1082 


BOND10YR 




0.1066 


2.228 


0.0950 


SAVPERDPI 




0.1054 


2 .0817 


0.1138 




UNEMPCIV 


0.0935 


1.753 


0.1679 




GDBTPERGDP 


0.0837 


1.553 


0.2122 




IDXPCEPER 


0.0377 


0.665 


0.5773 




DEFPEROUT 


0.0312 


0.548 


0.6517 




PFDBTGDP 


0.0252 


0.439 


0.7263 




SAVPERCHG 


0.0190 


0.329 


0.8046 




CHGBDRATE 


0.0172 


0.298 


0.8264 




GDEBTGDP 


0.0070 


0.120 


0.9480 



TABLE 2. Discriminate Analysis Results of Data Set 



50 



The discriminate analysis shown in Table 2 indicates 
that eight of the sixteen attributes were significant and 
can be used to discriminate between the four categories of 
the dependent variable (GDP Growth = HPOS, LPOS, LNEG, 
HNEG) . 

Error rate estimation (probabilities of 
misclassif ication) measures the discriminate model's 
performance. As shown in Table 2, five of the eight 
variables included in the model had error estimates 
(Probability > F Statistic) of less than 5 per cent while 
six of the eight nonselected variables had error estimates 
of greater than 50 per cent. Overall the analysis shows a 
strong discriminative association in eight of the sixteen 
independent variables. These results will be used to test 
the robustness of the rules produced using NPSGP and the 
same LHS attributes. 

B. IDIS 

As mentioned in Chapter III, IDIS uses a search 
algorithm blending statistical analysis with inductive 
heuristic learning. This algorithm is fixed within the 
program and search flexibility comes in setting the search 
control parameters, shown in Appendix B. These parameters 
allow IDIS to focus on generating rules from either narrow 
or broad search patterns. For example, liberalizing the 
program control parameters (confidence factor, error margin, 



51 



data range generality and minimum required records per rule) 
will produce rules with multiple LHS attributes and broad 
data value ranges. These rules may be too broad to be 
"interesting" and require further refinement. To focus these 
rules, control parameters are narrowed to reduce the 
selected LHS attribute range values, that is reduced 
attribute and range generalization. On the other hand, IDIS 
can also search for specific dependent variable categories. 
However, this may come at the expense of rule generalization 
and error estimation, particularly if the category 
represents only a small portion of the data set. 

Initially, a broad approach was used by selecting 
control parameters so that IDIS would search for a wide 
variety of rules linking LHS attributes to RHS categories. 
For example, control parameters were set at a certainty 
factor of 80 per cent, 25 per cent error margin, a minimum 
of 6 records per rule, 100 per cent range generalization and 
the dependent variable was set to produce rules for any of 
the RHS dependent categories (HPOS, LPOS, LNEG, HNEG) . 
However, these control parameters were too broad and only 
produced rules for HPOS. To search for rules in the other 
RHS categories, the control parameters were subsequently 
relaxed to include additional data elements. In addition, 
the dependent variable categories were focused on LPOS, LNEG 
and HNEG. The results of these searches are discussed below. 



52 



1. IDIS Results 

Initially IDIS produced rules including every LHS 
attribute, either singularly or in combination with other 
LHS attributes, but for only one dependent variable 
category, GDP GROWTH = HPOS . IDIS produced rules for GDP 
GROWTH = LPOS and LNEG only after restricting the RHS search 
to only those categories. The inductive rules for these 
categories are not as "good" as those for HPOS. Further, it 
was necessary to reduce the minimum record constraint to 3 
and increase the margin of error to 50 per cent before the 
system produced any rules at all. The only LPOS rule 
contained three attributes; this indicates that IDIS could 
not find a singular or dual attribute rule for LPOS. This 
singular rule is shown below: 



GDP GROWTH = LPOS 
IF 

0.3 <= GDPIPERCHG <= 7.6 
AND 

0.011 <= POPCHGPER <=. 0.0208 
AND 
-0.1843 <= GDBTPERGDP <= -0.0039 



CONFIDENCE FACTOR = 66.67 % 
MARGIN OF ERROR = 3 6.4 % 
NUMBER OF APPLICABLE RECORDS = 9 



53 



The only rule produced where GDP GROWTH was LNEG 
follows: 



GDP GROWTH = LNEG 
IF 

l.-l <= UNEMPCHG <= 2.9 



CONFIDENCE FACTOR = 63.64 % 
MARGIN OF ERROR = 33.0 % 
NUMBER OF APPLICABLE RECORDS = 11 



These results imply that a fixed search algorithm 
may not identify all potential rules in the RHS categories. 
The IDIS user may need to force the search for different RHS 
categories. As a tradeoff to finding these rules the user 
may be required to accept lower confidence factors and 
higher error estimates. 

C. NPSGP 

NPSGP uses a random and evolutionary approach (through 
adaptive competition) . There is no fixed search pattern 
built into the initial rule generation process. Unlike 
traditional methods which sometimes presume a specific 
structure (i.e., liner compensatory or correlated 
attributes) , NPSGP starts off with a set of randomly 
selected rules [Ref 3:p. 219] . Structural bias may be 
introduced as part of the fitness function that 
discriminates between these rules in later generations. 
NPSGP offers flexibility to focus the data mining because it 



54 



is easy to change the fitness function. The fitness 
function's effect on rule selection and along the NPSGP 
program parameters used in this research are discussed 
further in the following sections. 

1. NPSGP Parameters 

• For examples discussed in this chapter, NPSGP 
control parameters were set to the following values: 

1. Population size: 5,000. 

2. Crossover rate: 75 per cent. 

3. Random generator seed value: 4. 

4. Function Set was originally limited to one conjunction 
(AND) and later modified to include an arbitrary 

number of logical operators (AND, OR and NOT) . 

Population was set at 5,000 to limit CPU time and 
memory used in the analysis. As indicated in chapter V, 
crossover had a small effect on GP rule generation. An 
arbitrary crossover value of 75 per cent was assigned. 
Introducing of a random seed starts the random number 
generator used in selecting the initial rules. Changing the 
random generator seed number has a small effect on the types 
and ranges of the rules generated. However, these 
differences are expected to be normalized over a number of 
runs. In addition to the above parameters, NPSGP also 
allows the user to select the number of rules N, it shows at 
the end of each generation. These "top N" rules are listed 
in descending order of computed fitness value. To limit the 



55 



rule review process we set the list to show only the top 30 
rules for each generation. 
2. Fitness Functions 

This section illustrates how changing the fitness 
function can refocus rule discovery. Ultimately, this leads 
to different knowledge associations (rules) than either the 
statistical discriminate approach or the combined efforts of 
the singular search function deductive/inductive IDIS 
approach. 

A variety of fitness functions were tested in NPSGP, 
most yielding different sets of rules. To illustrate the 
impact that changing the fitness function has on the rules 
generated, the following two fitness functions were 
selected: 



Fitness=0.8- 



A+B+l (1) 



Fitness- 0-25S+0.5C+1 



2A+0 .5B+C+1 (2) 



The intent of fitness function (1) was to force the 
system to generate rules that are correct approximately 80 
per cent of the time. Since NPSGP minimizes the fitness 



56 



value, subtracting the calculated value of ( A / (A+B) ) from 
0.8 (80 per cent) will drive the value of the confidence 
equation , ( A/ (A+B)), to 80 per cent. Therefore, the 
fitness function (1) will be optimized when (A / (A+B) ) 
approaches 0.8 (assuming it is <= 0.8). When (A / (A+B)) is 
greater than 0.8, the fitness is set at an arbitrary high 
number so that the system will avoid such solutions. 

Fitness function (2) uses a weighted approach that 
assigns relative importance to the different observations, 
e.g., A, B, C, D. This approach demonstrates that rule 
search can be focused for a specific mix of observations. 
This increases rule exactness or generality, as desired by 
the user. For example, when the aim is to produce exact 
rules, observation A (HIT-HIT) receives a greater weight 
than the other observations. To broaden the rule 
applicability to the RHS, observation B (MISS-HIT) is 
assigned greater numeric weight. To produce LHS rule 
generality, observation C (HIT-MISS) is given more weight. 

Fitness function (2) uses this weighted approach to 
introduce more LHS and RHS generality" into the rules. This 
is accomplished by emphasizing A (HIT-HIT) , limiting 
emphasis on C (HIT-MISS) , de- emphasizing B (MISS -HIT) and 
ignoring D (MISS -MISS) . The following sections apply these 
two fitness functions to the data set. 



57 



c. Example C: 

The following rule is representative of the rules 

produced with fitness function (2) without the singular 

conjunction restraint. Again, rules including other 

categories (HPOS, LNEG, HNEG) were not among the top 30 

rules . 

1. Generation 30, Rule 1: 
GDP GROWTH - LPOS 
IF 

3.99 < UNEMPCIV < 5 . 84 
AND 

-1.55 < SAVPERCHG < 0.77 
AND 

-2.10 < PCEPERCHG < 2.36 
FITNESS = 0.106061 
CONFIDENCE = 100.00 % 
APPLICABLE RECORDS - 7 

4. Function Set Results 

In the above examples, rule search patterns have 
focused on simple rule associations using one conjunction 
(AND) , both for IDIS and NPSGP. Knowledge about complex 
data associations is not adequately represented using only- 
singular (AND) conjunctions. Rules representing complex 
associations between data elements can be improved by 
widening the choice of logical operators to include (OR and 
NOT). For example, the logical operator "OR", provides 
rules where there is choice between equivalent attributes ( 
IF X or Y THEN Z) . The operator "NOT" is used to depict 
exceptions in the rule ( IF X and NOT Z, THEN Y) . 



60 



To illustrate, the program using function set (1) 
was modified to permit unlimited combinations of logical 
operators (AND, OR, NOT) . NPSGP was able to produce rules 
using all three logical operators. Example D, demonstrates 
this additional flexibility. It provides a sample of the 
rules NPSGP produced using multiple logical operators. IDIS 
dose not provide this ability. 

a. Example D: 

1. Generation 4, Rule 1: 
GDP GROWTH = LPOS 
IF 

1.84 < CHGBDRATE < 5.69 
AND 

-1.56 < GDPIPERCHG < 2.23 
OR 

3.9 8 <UNEMPCIV < 5.41 

OR NOT 

0.49 < UNEMPCIV < 15.99 
AND 

-0.40 < GDPIPERCHG < 4.116 
FITNESS = 0.02222 
CONFIDENCE = 75.00 % 
APPLICABLE RECORDS = 8 

D. COMPARISON OF RESULTS 

Comparing the data element associations developed by the 
three different approaches brought out two significant 
differences. First, the ability to focus the search 
algorithms ( i.e., "fitness function"), to force the search 
pattern toward exact or general patterns is the most 
important component of any knowledge discovery program. If 
the algorithm can not be focused, the knowledge discovery 



61 



program is locked into the fixed search pattern defined by 
that algorithm. 

With IDIS, control parameters used in conjunction with 
the search algorithm can be changed. However, this dose not 
change the fixed search algorithm itself. This limitation 
made it difficult for IDIS to find rules for the RHS 
categories LPOS, LNEG and HNEG. NPSGP was able to discover 
some rules in these categories using fitness function (1) . 

Additionally, the discriminate analysis function was 
limited to statistical association tests, such as the F 
test, R 2 partial correlation and Wilke's Lambda tests, to 
find data element associations between dependent and 
independent variables. This stepwise discriminate data 
analysis eliminated some interesting attributes from further 
analysis. These attributes were used to produce rules in 
both IDIS and NPSGP. For example, NPSGP developed rules 
using the eight significant variables in the discriminate 
analysis. It also developed rules wi.th a few of the 
variables classified as non-significant ( e.g., Example C 
included SAVPERGHG, Example B included GDEBTGDP and Example 
D included CHGBDRATE) . 

For the second difference, including additional logical 
operators ( NOT and OR) increases robustness in rule 
generation. It provided both exceptions to the rule and/or 
equivalent attributes within the same rule. Due to it's 
limited rule structure, IDIS does not include additional 

62 



logical operators. It is limited to the single conjunctive 
"AND." NPSGP's flexibility to include "OR" and "NOT" 
operations leads to more complex rules and associations that 
allow deeper data mining with complex data element 
associations. 



63 



VII CONCLUSIONS AND RECOMMENDATIONS 

A. CONCLUSIONS 

The focus of this research is to compare several methods 
of knowledge discovery according to their ability to 
identify associations in data elements. One of the methods 
chosen for this purpose was a statistical approach, 
discriminate stepwise analysis. Stepwise discriminate 
analysis identifies independent variables that statistically 
discriminate against various dependent variables. The 
second method IDIS, combines statistical and inductive 
methods. IDIS represents associations between data elements 
by production rules. The last method is a genetic program 
that evolves rules through competitive adaption. 

This analysis compared the structural characteristics of 
the rules generated, rather than their usefulness. It 
should be noted that the definition of usefulness is very 
subjective and depends heavily on the purpose for which the 
rules were generated. The test data set used in this study 
combined Gross Domestic Product information and economic 
indicators . 

Several findings resulted from the analysis. First, the 
NPS GP model was able to develop rules involving more RHS 
categories than IDIS. In one GP program run (Example A in 



64 



Chapter VI) , rules were produced for all four dependent 
variable categories. IDIS produced an abundance of rules 
using all attributes for HPOS . However, IDIS was unable to 
produce any rules for HNEG. Relaxing IDIS's constraints 

(control parameters) to very low confidence levels and high 
error estimates only produced one rule LPOS . NPSGP also 
found rules with more LHS attributes. Further more, the 
stepwise discriminate analysis found eight statistically 
significant LHS attributes. NPSGP found rules involving 
three additional LHS attributes; SAVPERCHG, GDEBTGDP and 
CHGBDRATE . 

/ Second, NPSGP demonstrated that crossover had a limited 
effect on the GP's rule generating capabilities for small 
quantitative data sets (16 attributes* by 63 instances) . 
Population size was more important in GP generated rules, as 
measured by the fitness function. Population size primarily 
affected the selection and range of the attributes chosen. 
Third, fitness function (search algorithm) is the most 
important parameter used in knowledge discovery programs. 
It sets the search focus for selecting independent and 
dependent variables. The fitness function is used as a 
filter to select rules that fit into a particular schema, 
e.g., exact or general rules. IDIS and stepwise 
discriminate analysis are locked into a fixed search 
pattern. The flexibility to change the fitness function 



65 






(search algorithm) gives the NPSGP program a significant 
advantage . 

Finally, the ability to expand the function set by 
adding of logical operators such as "OR" and "NOT" gives 
NPSGP the potential to handle complex rule associations not 
covered by IDIS or statistical methods. 

B. FUTURE RESEARCH 

Recent developments in genetic programming offer great 
potential for knowledge discovery. GP is not restricted to 
a priori deductive and inductive paradigms that limit the 
scope of knowledge discovery. It offers a radical 
breakthrough that cuts through limitations of traditional 
approaches. NPSGP has shown that genetic programming is a 
viable approach to knowledge discovery, but the program is 
still in its infancy. The following recommendations are 
offered to help develop NPSGP into a mature and robust 
system with greater applicability to a diverse range of 
knowledge discovery. 

1. Fitness Function Applications: 

This study demonstrated that fitness function plays 
a critical role in determining the scope and depth of the 
rules generated. Comparative research is needed to 
determine the optimal type(s) of fitness measures to use in 
GP. Fitness functions should be tailored to the different 
data schema, e.g., quantitative, scaler, combinations of 



66 



scaler and nonscalar, etc. This would enhance GP 
capabilities. For example, a weighted fitness measure ( ( B± 
+ V^) / A^; were W i represents the number of records in each 
category i divided by the total number of records) was 
successful in focusing the rules toward the smallest 
category in the data set, GDP GROWTH = HNEG. 

Additionally, this study used a relatively simple 
database of quantitative data to develop and test NPSGP. 
Other more complex data sets may require analysis based on 
fitness functions using non- linear logarithmic or geometric 
relationships between data elements, attributes or dependent 
variables . 

Finally, using parallel GP programs with multiple 
fitness measures for the same data search may expedite the 
knowledge discovery process. 

2. Different Control Parameters 

Because NPSGP is in the initial testing stage, a 
complete diagnosis of its capabilities was not feasible. 
Running GP with multiple data sets and experimenting with 
the GP control parameters would be very productive. At a 
minimum, the following control parameters should be studied 
to determine their potential impact on rule generation: 

a. Mutation vice Crossover, or both. 

b. Tournament size. 



67 



c. Minimum population size as a factor of data size. The 
use of an extremely large population size; 100,000 and 
above . 

3. Develop Measurements of Rule Fit 

Another potential topic of research is to develop a 
fitness measure that indicates how the LHS values "fit" the 
range of data element values. For example, computing 
standard deviations for quantitative data and then comparing 
these against the range of LHS values in the rule may- 
indicate how tight the rule value ranges fit the data set. 
This measure may indicate how narrow .or broad the rule fits 
the range of attribute values, allowing another quantitative 
measure of the quality of the rule. 

4. Database Objects 

The ability to define virtual data attributes offers 
great potential for knowledge discovery systems. Defining 
new objects by combining attributes and/or specifying ranges 
for attribute values focuses the knowledge discovery system 
on multiple dependent variables packaged into a "virtual" 
attribute. For example, a virtual attribute "EQUIPMENT 
CASUALTY" may be used to define an equipment failure as a 
combination of attributes OPERATIONAL, INSTALLATION CODE, 
CRITICAL MISSION CODE, etc. If the equipment is non- 
operational, installed on the aircraft and critical to a 
mission area, you have an instance of EQUIPMENT CASUALTY. 



68 



This "virtual" attribute would help focus the knowledge 
discovery search on a predefined set of attribute values. 

C. FUTURE DOD APPLICATIONS 

The future of knowledge discovery systems lies in its 
ability to produce meaningful associations from data 
elements in existing databases. This research has 
demonstrated that new knowledge discovery systems involving 
genetic programs', such as NPSGP, offer significant potential 
for knowledge discovery. 

As the largest single organization in the world, the 
Department Of Defense (DOD) has an abundance of databases 
that could be analyzed using the knowledge discovery 
techniques outlined in this thesis. The range and use of 
these databases are as varied as in the commercial sector. 
IDIS, for example, is already being used by the U.S. Army 
and Air Force Exchange systems. Logistic and maintenance 
system applications are but two additional future targets 
for knowledge discovery systems. The potential benefit in 
applying GP knowledge discovery systems to DOD databases is 
conceptually large, offering insights to undiscovered data 
relationships that may have operational and financial 
implications. 



69 



APPENDIX A TEST DATA 

Test data used in evaluating the statistical, IDIS and 
NPSGP programs consisted of the following econometric data: 

1. YEAR: The corresponding year for the data. 

2. GDP GROWTH: As explained in chapter IV, GDP Growth was 
assigned a discrete value of four categories; high 
positive (HPOS or 0) , low positive (LPOS or 1) , low 
negative (LNEG or 2), and high negative (HNEG or 3) . 
These categories were determined by using the growth 

(decline) in Indexed GDP from the previous year. 
Indexed GDP was taken from [Ref. 16 :p. 341-474] and 
represents an implicit price def-lator (1987 = 100) . 
HPOS corresponds to a growth greater than 2.5 per cent. 
LPOS represents a growth from to 2.5 per cent and 
LNEG represents growth from -0.001 to -2.5 per cent. 
HNEG growth represents any decline in GDP greater than 
-2.5 per cent. 

3. IDXPCEPER: Indexed Personal Consumption (1987 = 100) as 
a percentage of GDP. 

4. PCEPERCHG: The change in indexed Personal Consumption 
Expenditures from the previous year. 

5. INDXGPDIPER: Indexed Disposal Income as a percentage of 
GDP. 

6. GDPIPERCHG: The change in Personal Income as a per cent 
of GDP from the previous year. 

7. EXPPERGDP: Exports as a per cent of GDP. 

8. SAVPERDPI: Personal Savings as a percent of Disposal 
Personal Income. 

9. SAVPERCHG: The change in Personal Savings from the 
previous year. 

10. POPCHGPER: The percentage change in U.S. population 
from the previous year. 



70 



11. BOND10YR: The rate on 10 year U.S. Treasury Bonds, or 
equivalent. Years 1930 - 1932 and 1934 - 1938 were 
estimates. 

12. CHGBDRATE: The change in the 10 year U.S. Treasury 
bond rate, or equivalent, from the previous year. 

13. UNEMPCHG: The percentage change in civilian U.S. 
unemployment from the previous year. 

14. UNEMPCIV: U.S. civilian unemployment. 

15. DEFPEROUT: The U.S. deficit as a percentage of U.S. 
government outlays . 

16. GDEBTGDP: U.S. government gross debt as a per cent of 
GDP. 

17. PFDBTGDP: U.S. government public debt as a per cent of 
GDP. 

18. GDBTPERGDP: The percentage change in U.S. 
government gross debt as a per cent of GDP from the 
previous year. 






71 



APPENDIX B IDIS PARAMETERS 

IDIS provides adjustable control parameters permitting 
the user to focus rule search toward specified goals. These 
parameters are listed below with brief explanations. 

1. SAMPLING PERCENTAGE: This parameter is used to set the 
proportion of the database to be used in knowledge 
discovery. 

2. INTEREST LEVEL: This parameter is used to set the 
interest level of the independent variables. INTEREST 
LEVEL sets priorities on those LHS attributes selected 
in the rule search. The higher the setting the higher 
was the user's interest in that attribute. 

3. MAXIMUM LENGTH OF RULE: This parameter sets the maximum 
number of LHS attributes that appear in the rule (i.e., 
the number of combine LHS attributes the generated rule 
should not exceed) . 

4. MAXIMUM MARGIN OF ERROR: The MARGIN OF ERROR calculated 
in rule production gives the range of estimated error 
of the rules computed confidence level. As a control 
parameter setting, MARGIN OF ERROR expresses the user's 
tolerance for error in the estimate of the confidence 
level. Calculation of this parameter is proprietary and 
has not been disclosed by IntelligenceWare, Inc., the 
developer of IDIS. 

5. MINIMUM RULE CONFIDENCE: RULE CONFIDENCE is calculated 
in IDIS by dividing HIT-HIT observations by HIT-HIT and 
MISS -HIT observations. As a parameter RULE CONFIDENCE 
sets desired rule generality by specifying the minimum 
acceptable ratio of observations that fall into this 
calculated range. 

6. MINIMUM PER CENT OF DATABASE FOR RULE FORMATION: This 
parameter specifies the minimum percentage of LHS 
instances, as a per cent of total instances, that must 
be contained in the rule. It is the number of records 
in the data set, as a percentage of total records, that 
are included in the rule. 



72 



7. MINIMUM NUMBER OF RECORDS USED FOR RULE FORMATION: This 
parameter is the integer equivalent of the MINIMUM PER 
CENT OF DATABASE FOR RULE FORMATION. 

8. MAXIMUM GENERALITY: This parameter specifies the 
largest scaler range of the LHS attributes permissible 

for rule generation. 



73 



APPENDIX C NPSGP USERS GUIDE 

The following lists a set of procedures that can be used 
to execute NPSGP. This listing contains a minimum set of 
user guidelines required to run the program using a UNIX 
workstation. 

1. Load the data set either as an ASCII file. Delete all 
irrelevant data and insure the dependent variable is in 
defined in the first column. Saves the file as a text 
file with the name (econ.tab). 

2. Convert the ASCII file to a "C" program structured file 
using the command: 

perl define.pl -15 econ.tab 

NOTE: "-15" sets the lag time for the data, the default 
value if none is entered is 5. 

3. Recompile the program by typing the MAKE command. 

4. NPSGP program is executed using the following commands: 

a. To start a new program: 
gpc 1 50 none 4 

b. To restart a program from a failure checkpoint: 
gpc -r 1 5 none 4 



NOTE: »-r" specifies the restart from a checkpoint; 
"50" specifies the total number of generations 
to run; 
"4" represents the seed number. 

Changing the default settings of population size, 
crossover rates, random seed, tournament selection, 
etc. can be done through the program module 
"Default . in." . Changing these parameters does not 
require recompilation of NPSGP. 



74 



6. Changing the fitness function can be done in the 

program module "Fitness. C" . NOTE: Changing this module 
REQUIRES recompilation of NPSGP. 



75 



LIST OF REFERENCES 



(1) Frawley, William J. , Piatetsky-Shapiro, Gregory, and 
Matheus, Christopher J. , Knowledge Discovery in Databases: An 
Overview, KNOWLEDGE DISCOVERY ON DATABASES, 1991. 

(2) Parsaye, Karman , Chignell Mark ,Knoshafian Setrag and 
Wong Harry, Intelligent Databases, Object-Oriented, and Deductive Hypermedia 
Technologies , 19 89. 

(3) Green, D. P.. and Smith, S. F., A Genetic System for Learning 
Models of Consumer Choice , Proceedings of the Second 
International Conference on Genetic Algorithms, MIT. 
Cambridge, Hillsdale, N. J. rLawrence Erlbaum, 1987. 

(4) Winkler, R. L. and Hays, W. L., Statistics: Probability, Inference, 
and Decision, Second Edition, Holt, Rinehart and Winston, 1975 

(5) Weiss, Neil A., and Hassett, Matthew J., Introductory 
Statistics, Third Edition, 1991. 

(6) SAS Institute Inc. , SAS/STAT User's Guide, Release 6.03 Edition, 
Cary, NC: SAS Institute Inc., 1988. 

(7) Messier, William F., Jr, and Hansen, James V., Inducing 
Rules for Expert System Development: An Example Using Default and Bankruptcy 
Data, Management Science. 

(8) Bundy, Alan, Silver, Bernard and Plummer, Dave, An 
Analytical Comparison of Some Rule-Learning Programs , Artificial 
Intelligence, April 1985. 

(9) Parsaye, Kamran and Chignell, Marke, Intelligent Database Tolls 
& Applications: Hyperinformation Access, Data Quality, Visualization, Automatic 
Discovery, John Willey and Sons, Inc. 1993. 

(10) Parsaye, Karman, Wiat can IXL do that Statistics cannot? , 

IntelligenceWare, 1990. 

(11) Koza, John R., Genetic Programming: On the Programming of 
Computers by Means of Natural Selection, MIT Press, 1992. 



76 



(12) Liepins, G.E. and Hillard, M.R., Genetic Algorithms: 
Foundations and Applications , Annals of Operations Research, 1989. 

(13) Burtka, Michael, Genetic Algorithms , The Stern Information 
Systems Review, published by the Stern School of Business, 
New York University, Spring 1993. 

(14) Dejong, Kenneth A., Genetic- Algorithm-Based Learning: Machine 
Learning Volume III , Navy Research Laboratory, 1990. 

(15) Nissen, Vollcer, Papers on Economic and Evolution ti 9303 ; 
Evolutionary Algorithms in Management Science, July 1993. 

(16) Council of Economic Advisers, Economic Report of the President , 
January 1993 , Published by the United States Printing Office, 
Washington 1993.' 



77 



INITIAL DISTRIBUTION LIST 

1. Defense Technical Information Center 
Cameron Station 

Alexandria, Virginia 22304-6145 

2. Library, Code 52 

Naval Postgraduate School 
Monterey, California 93943-5002 

3. Mohammed A. Al-Mahmood 
P.O.Box 5998 

Manama, State of Bahrain 

4. Steven L. Smith 

12 09 7 Winona Drive 

Lake Ridge, Virginia 22192 

5. Balasubramaniam Ramesh, Code RA 
Naval Postgraduate School 
Monterey, California 93943-5002 

6. William Gates, Code GT 
Naval Postgraduate School 
Monterey, California 93943-5002 



78 



DUDLEY KNOX JBR 

NAVAL POSTGRADUATE SChO^. 

MONTEREY CA 93943-5101 



GAYIORD S