DUDLEY KNOX LIBRARY NAVAL POSTGRADUATE SCHOOL MONTEREY CA 93943-5101 Approved tor public release; distribution is unlimited. KNOWLEDGE DISCOVERY USING GENETIC PROGRAMMING by Steven L. Smith Lieutenant, United States Navy MB. A., University of California , Los Angeles. 147b and Mohammed A. Al-Mahmood Captain, Bahrain Defense Force B.S.E., University of Petroleum And Minerals, Saudi Arabia, 1985 Submitted in partial fulfillment oi the requirements for the degree of MASTER OF SCIENCE IN INFORMATION TECHNOLOGY MANAGEMENT from the NAVAL POSTGRADUATE SCHOOL December 1993 / REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704 Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 16 December 1993 3. REPORT TYPE AND DATES COVERED Master's Thesis TITLE AND SUBTITLE KNOWLEDGE DISCOVERY USING GENETIC PROGRAMMING 6. AUTHOR(S) AJ-Mahmood, Mohammed A. and Smith, Steven L. 5. FUNDING NUMBERS 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School, Monterey CA 93943-5000 PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING AGENCY REPORT NUMBER li. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. 12a. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution is unlimited. 12b. DISTRIBUTION CODE *A 13. ABSTRACT Dramatic growth in database technology has outpaced the ability to analyze the information stored in databases for new knowledge and has created an increasing potential for the loss of undiscovered knowledge. This potential gains for such knowledge discovery are particularly large in the Department of Defense where millions of transactions, from maintenance to medical information, are recorded yearly. Due to the limitations of traditional knowledge discovery methods in analyzing this data, there is a growing need to utilize new knowledge discovery methods to glean knowledge from vast databases. This research compares a new knowledge discovery approach using a genetic program (GP) developed at the Naval Postgraduate School that produces data associations expressed as IF X THEN Y rules. In determining validity of this GP approach, the program is compared to traditional statistical and inductive methods of knowledge discovery. Results of this comparison indicate the viability of using a GP approach in knowledge discovery by three findings. First, the GP discovered interesting patterns from the data set. Second, the GP discovered new relationships not uncovered by the traditional methods. Third, the GP demonstrated a greater ability to focus the knowledge discovery search towards particular relationships, such as producing exact or general rules. 14. SUBJECT TERMS Genetic Programming. Knowledge Discovery, Datamining, IDIS, Genetic Algorithms, Inductive Learning, Inductive Rules. 15. NUMBER OF PAGES 85 16. PRICE CODE 17. SECURITY CLASSIFI- CATION OF REPORT Unclassified SECURITY CLASSIFI- CATION OF THIS PAGE Unclassified 19. 20. SECURITY CLASSIFI- CATION OF ABSTRACT Unclassified LIMITATION OF ABSTRACT UL NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. 239-18 ABSTRACT Dramatic growth in database technology has outpaced the ability to analyze the information stored in databases for new knowledge and has created an increasing potential for the loss of undiscovered knowledge. This potential gains for such knowledge discovery are particularly large in the Department of Defense where millions of transactions, from maintenance to medical information, are recorded yearly. Due to the limitations of traditional knowledge discovery methods in analyzing this data, there is a growing need to utilize new knowledge discovery methods to glean knowledge from vast databases. This research compares a new knowledge discovery approach using a genetic program (GP) developed at the Naval Postgraduate School that produces data associations expressed as IF X THEN Y rules. In determining validity of this GP approach, the program is compared to traditional statistical and inductive methods of knowledge discovery. Results of this comparison indicate the viability of using a GP approach in knowledge discovery by three findings. First, the GP discovered interesting patterns from the data set. Second, the GP discovered new relationships not uncovered by the traditional methods. Third, the GP demonstrated a greater ability to focus the knowledge discover)' search towards particular relationships, such as producing exact or general rules. 111 6 J TABLE OF CONTENTS I INTRODUCTION 1 A. BACKGROUND 1 B. OBJECTIVES 3 C. RESEARCH METHODS 3 D. THESIS ORGANIZATION 4 II KNOWLEDGE DISCOVERY 5 A. BACKGROUND 5 B. WHAT IS KNOWLEDGE DISCOVERY? 6 1. Patterns 7 a. Pattern Representation 9 2. Certainty 10 3. Interesting 12 4. Efficiency 13 III EVOLUTION OF KNOWLEDGE DISCOVERY SYSTEMS .... 15 A. STATISTICAL APPROACHES 15 1. Statistical Association Methods 16 B. INDUCTIVE LEARNING 17 1. CLS and ID3 20 2. AQ Algorithms 21 3. IXL/IDIS 23 C. SUMMARY 24 iv "•* >->t^i\r\r\ i NAVAL POSTGRADUATE SCHOni MONTEREY CA 93943 5 fo, L IV GENETIC ALGORITHMS / GENETIC PROGRAMMING 2 6 A. GENETIC ALGORITHMS 2 6 B. SAMUEL: A RULE GENERATING GENETIC ALGORITHM . . 29 C. GENETIC PROGRAMMING 31 V NPS GENETIC PROGRAM (NPSGP) 33 A. BACKGROUND 33 1. Data Description 33 B. CONTROL PARAMETERS 35 1. Fitness Measure 35 2. Selection Criteria 37 3. Population and Generation 38 C. EVOLUTION OF RULES 3 8 D. AN EXAMPLE OF THE GP PROCESS 40 E. GP EVALUATION RUNS 44 F. SUMMARY 46 VI COMPARISON OF KNOWLEDGE DISCOVERY RESULTS .... 47 A. STATISTICAL ASSOCIATIONS 48 1. Results: 49 B. IDIS 51 1. IDIS Results 53 C. NPSGP 54 1. NPSGP Parameters 55 2. Fitness Functions 56 3. Fitness Function Results 58 v a. Example A: 58 b. Example B: 59 c. Example C: 60 4. Function Set Results 60 a. Example D: 61 D. COMPARISON OF RESULTS 61 VII CONCLUSIONS AND RECOMMENDATIONS 64 A. CONCLUSIONS 64 B. FUTURE RESEARCH 66 1. Fitness Function Applications: 66 2. Different Control Parameters 67 3. Develop Measurements of Rule Fit 68 4. Database Objects 68 C. FUTURE DOD APPLICATIONS 69 APPENDIX A TEST DATA 70 APPENDIX B IDIS PARAMETERS . . . / 72 APPENDIX C NPSGP USERS GUIDE 74 LIST OF REFERENCES 76 INITIAL DISTRIBUTION LIST 78 VI I INTRODUCTION A. BACKGROUND In his book POWERSHIFT, Alvin Toffler sees the acquisition of "knowledge" as the leverage behind the changing political and social structures in today's technologically oriented world. Industrial knowledge, for example, has been used to gain competitive advantage, allowing increased efficiency and the ability to underprice competitors. Similarly, the cornerstone of the United Nations Coalition success in Desert Storm was the absolute knowledge advantage demonstrated by a technological, tactical and training military superiority over Iraqi military forces. Historically the acquisition of knowledge has been a time and labor intensive undertaking. But leveraged by computer technology, we now live in an era where the velocity of social, economic and political change has accelerated dramatically. To keep pace and understand these changes, there is a real need to develop new and automated methods of discovering knowledge that circumvent the traditional time consuming methods. In the Department of Defense, for example, there are vast databases composed of millions of computerized files recording logistic and maintenance actions. More specifically, at Naval Sea Logistic Center (NAVSEALOGCEN) in Mechanicsburg Pa, the Navy's Material Maintenance Management (3M) system has recorded millions of shipboard maintenance actions in a vast database. These recorded transactions give detailed information regarding maintenance actions; such as the types of equipment and system failures; required parts and actions codes representing steps to correct the problem. Across the street on the same Navy base in Mechanicsburg, the Navy's Ships Parts Control Center (SPCC) has created a vast database of repair parts usage data. This database contains information on shipboard parts usage, including historical demand, manufacturer, quality assurance codes, etc. These databases offer a potential harvest of new knowledge for future applications of knowledge discovery systems. For example, by using both the 3M and SPCC databases, researchers could develop associations between the types of equipment failures, repair parts usage and mean times between failure to develop predicative associations for the next failure. Instead of waiting for equipment casualties, maintenance personnel could use this information to replace parts in a high risk category of failure. This thesis reviews different computer aided methods in the acquisition of knowledge. First, it discusses the more traditional methods of developing deductive and inductive associations between data leading to the discovery of knowledge. After discussing the limitations of these approaches, the thesis demonstrates that newer nontraditional methods of knowledge discovery, that is, genetic programs, can be used successfully as an alternative approach to knowledge discovery. B. OBJECTIVES The objective of this research is to determine whether Genetic Programming offers insight into relations among data elements not readily discovered by traditional methods. The ability to develop knowledge is represented by the capability to generate meaningful associations among elements of a database and express these relationships in terms of patterns expressed as rules. This thesis focuses on inductive methods of learning relationships among attributes from a data set and formalizes these relationships into heuristic rules. The objective is to compare the ability to produce rules from a database using genetic programing with the capability of with traditional statistical and inductive methods. C. RESEARCH METHODS The focus of the research is to produce rules from quantitative data using a genetic program developed at the Naval Postgraduate School (NPS) . The GP system initially \/ produces random data associations (rules) and then evolves these rules through an evolutionary adaption and selection process. The same data set was analyzed using traditional statistical methods and a commercial software program using deductive/inductiv- methods to develop rule associations. The three techniques focuses are compared according to the types of data associations they produce, rather than assessing the value of the information produced. D. THESIS ORGANIZATION This document is structured to cover a variety of key concepts prior to comparing the rules- generated from the NPS genetic program and other statistical/induction/deduction based methods. First, the general concepts behind knowledge discovery is discussed in Chapter II. The evolution of knowledge discovery systems is covered in Chapter III with a review of the apparent strengths and weaknesses of these traditional systems. Chapter IV discusses the field of genetic algorithms and programs. Chapter V reviews the GP developed at the Naval Postgraduate School for this research and Chapter VI lays out the results of the data analysis of the different approaches. Chapter VII concludes the research by comparing the rules generated and .offers suggestions for future applicability involving rule generating genetic programs . II KNOWLEDGE DISCOVERY A. BACKGROUND The recent proliferation of inexpensive computer memory has caused extensive growth in the size and amount of existing data storage capacity. One estimate states the amount of stored information in the world doubles about every two years [Ref . l:p. 2] . This vast storehouse of unanalyzed data in turn has created a surge in demand for automated methods of shifting through and compressing this raw data into meaningful information, i.e., useful "knowledge" . In addition to the dramatic increases in mass storage capacity of digitized information, there are three other major technologies in the field affecting the direction of information technology and the information infrastructure that supports the foundation for knowledge discovery: on line databases, networking and digitization of information [Ref. 2:p. 18] . On line databases are a major technological advancement playing a role in knowledge discovery. Historically, man's knowledge was archived in the form of books and drawings. Digitizing this information for accessibility offers tremendous advantages over textual information. On line databases represent a vast and vital storehouse of information that facilitates both the necessary functions of data access and retrieval. Networking, the connection of computer systems through local and wide area networks, also plays a vital role in developing future knowledge discovery systems. Openness, connectivity, data sharing and interoperability all share the same ability to transfer data to remote sites almost instantly. Finally, most of the current world's data is not in a digitized medium. The challenge is to translate vast amounts of accumulated knowledge into binary formats for future knowledge discovery. The current method for translating this historical information is through Optical Scanning. Although a relatively new technology, optical scanning represents a potential bridge between the written past and current data storage techniques. Before going further into the methods of knowledge discovery, the salient characteristics of knowledge will be reviewed to gain insight into the nature of knowledge vice unsubstantiated or meaningless associations. B. WHAT IS KNOWLEDGE DISCOVERY? Knowledge discovery is the nontrivial extraction of implicit, previous unknown, and potentially useful information from data. [Ref. l:p. 3] If this definition seems somewhat vague it hits at the core definition of knowledge. What is nontrivial to one is vital to another. What is unknown to one may be common knowledge to others. The problem is a lack of a singular defining concept of knowledge to determine if knowledge has actually been discovered. The approach used in this thesis is to envelope the concept of "knowledge" by defining it's attributes. The attributes that are used to surround the concept of knowledge are discussed in detail in the following sections. [Ref. l:p. 3]. 1. Patterns One way to look at patterns is as a collection or class of records sharing something in common . . . Discovering pattern classes is a problem of pattern identification or clustering. [Ref. l:p. 15] The first attribute of knowledge and knowledge discovery, is the concept of pattern recognition. Frawley, Piatetsky- Shapiro and Matheus point out two basic approaches to knowledge discovery through pattern recognition: traditional numeric methods and conceptual clustering [Ref. l:p. 15]. These methods of discovering "patterns" in databases use either statistical correlations or develop heuristic rules, depending on the type of knowledge the researcher is attempting to find. An example of statistical correlations is regression models, where the dependent variable (output) is calculated from a combination of independent (input) variable weights. Most regression techniques are parametric; they require the user to specify the functional form of the solution. If the underlying form of the function is unknown, parametric methods tend to break down. As an illustration, a sample database of financial information can be analyzed to determine the correct combination of attributes to maximize income, I a given constraints, C x .. .C n , this problem lends itself to a statistical optimizing approach. The results of this analysis are often in the following form: INCOME - a*Var x + b*Var 2 ... c*Var n - d*^...- d*C n (where INCOME is the dependent variable, a...d are constants, Va^ n are income independent variables and C-l n are constraint (cost) independent variables . ) Such approaches are termed compensatory. They are based upon the assumption that trade-offs between relevant attributes will maximize (or minimize) the overall evaluation [Ref . 3:p. 217]. There are, however, two basic concerns with this approach, despite evidence of strong predictive performance. First is the concern that compensatory methods are good models for developing data associations, particularly when dealing with non- compensatory associations. The second concern is that numerical coefficients provide limited insight into relationships among the other attributes [Ref. 3:p. 218]. Conceptual clustering, the alternative approach, works with both nominal and structured data. This is a definite advantage over the liner models when analyzing databases. Coupled with logical representations, such as production systems ( for example, if X, then Y) clustering offers representations that overcome the restrictive nature of compensatory statistics. For example, using the same financial database to identify inductive relationships results in clustering the data into groups and measuring the validity of these associations by hypothesis testing. From these methods heuristic rules are produced to explain data associations in production contexts: IF a l < Var 1 < SL : AND *>1 < c 2 < b 2 THEN INCOME = HIGH (INCOME is the dependent variable, Var 1 n and constraints C 1 n are the independent variables and a i 2 ' ^l 2 are' constants . ) a. Pattern Representation Once patterns have been recognized, they must be represented. The intent is to convey statistical associations to a variety of users who may or may not have a basic understanding of statistics. Therefore, these patterns should use standard representations, communicating complex associations. Logical representations are more natural than statistical representations for computation and can be used in natural language forms. Common logic representations include the production rules, mentioned earlier, relational patterns ( X > Y) and decision trees (equivalent to ordered lists of rules) . Because a combination of logical formats and natural language is easier to interpret than complex equations, this research focuses primarily on the inductive style of clustering patterns and presenting them in terms of production rules. The use of production system formalism is an important advantage since it is apparently consistent with human reasoning and therefore more easily understood. [Ref . 3:p. 218] . 2. Certainty Rarely is knowledge absolute, particularly when dealing with data containing inexact and missing elements. To overcome noisy and inexact data, knowledge researchers need quantitative measures indicating the level of trust they can place on the knowledge developed by the database discovery programs. Without the ability to attach a subjective level of faith or quantitative measures of confidence, patterns discovered from databases become merely suppositions. Therefore, they never achieve the status of knowledge . 10 Computing a measure of certainty involves not only the integrity of the data analyzed but also the data sample size. As a means of representing uncertainty, this requires the developing probabilistic information regarding generated rules. One common technique is to augment logical expressions with probabilistic weights indicating probabilities of success. Simple contingency tables can be used either to test predictive validity or lack of statistical association between categorical attributes when a rule agrees or disagrees with the data set. These contingency tables represent the ability of production rules to fit the data set using two components, dependent and independent variables. In terms of a production rules , independent variables are the "input" attributes and values that produce a rule associated with an "output" category of the dependent variable. Expressed in IF X THEN Y format, the dependent variables represent the right hand side (RHS) of the rule while the independent variables represent the left hand side (LHS) . For example: 11 Basic Contingency Table ("Hit" implies the rule conditions were meet in the database. ) (RHS) DEPENDENT VARIABLES HIT NOT HIT (LHS) INDEPENDENT HIT a b VARIABLES NOT HIT c d (Where a are data set and rule category agreements, b and c are errors of omission of either the rule category or data set, and d represents misses to both arguments) . Such contingency tables can measure trust in the rules by several methods. As an example, simple confidence factors, such as (a / (a + b) ) , can be used to indicate the accuracy of the rules in the database. Statistical confidence can be estimated for the parameters of the contingency table using binomial distributions. Contingency tables can also be used to develop frequencies of cross classification, joint and marginal probabilities. [Ref. 4:p. 175] . Binomial standard deviations can also be developed for the conditions in the contingency tables (HIT-HIT, NOT HIT-HIT, etc. ) . 3. Interesting Discovered patterns are meaningless if the information contained within these patterns is without relevancy to the user. Patterns must be interesting, that is, useful. "Knowledge is useful when it can achieve a goal of the system or the user." [Ref. l:p*. 4]. 12 To achieve usefulness, knowledge discovery methods must perform functions that filter trivial from non- trivial information based on the user's interest. But such a filtering in itself can be a double edged sword. A knowledge discovery system that incorporates functions for predefining attributes of interest can also de facto limit the scope of knowledge discovery. For example, a researcher may be interested in relationships between certain dependent and independent variables, to the exclusion of other variables in the database. Even though a researcher may be searching for patterns about variable X that involve variables Y and 'Z, the researcher may not know what those patterns look like. They may involve other variables such as A and B. The key to this dilemma is to provide the discovery system the ability to define "areas of interest" and still allow enough autonomy in mining to data for discover unanticipated patterns. 4. Efficiency There are several efficiency measures for a knowledge discovery system. A major -factor in efficiency is the degree of processing time required per unit of data analysis. Searching for knowledge is usually accomplished in large databases -requiring significant memory and CPU time. Mitigating this concern are the fact that large knowledge discovery programs, however, are not routine jobs and are 13 usually done in the background of other computer transactions . 14 Ill EVOLUTION OF KNOWLEDGE DISCOVERY SYSTEMS A. STATISTICAL APPROACHES Traditionally, relationships in data have been discovered through the statistical methods of deduction and inference. In statistical methods, the quantitative properties of information are categorized into either descriptive or inferential statistics [Ref. 5:p. 2]. Descriptive statistics is a method for organizing and summarizing information. For example, breaking down data into categories and developing percentages represents a descriptive approach (e.g., 1992 election was between three parties; republicans, democrats and independents who received 46, 38 and 12 per cent of the popular vote) . Inferential statistics is a method for developing conclusions regarding the data by analyzing a sample of the data. Inferential statistics relies on mathematical models of distribution, e.g., based upon a random sampling of 500 people, CNN newswire stated that 43 per cent of Americans approved of President Clinton's handling of events in Somalia. To effectively classify and generate interesting statements regarding analyzed data, knowledge discovery systems must employ both descriptive and inferential statistics . 15 1. Statistical Association Methods There are numerous approaches for developing associations through descriptive and inferential statistics. One is the multivariate linear regression models, as discussed in Chapter II B.l. Another is statistical cluster analysis, a form of conceptual clustering, places objects into groups (clusters) suggested by similarities in the data. Data is summarized in these clusters by similarities in the data characteristics. In cluster analysis, a priori knowledge of the data set is not required. Discriminate analysis is another approach that selects a subset of the quantitative independent variables that reveal differences among the dependent variable categories [Ref . 6:pp. 39-40] . The selection process is accomplished through statistical association tests, such as the F test, R 2 partial correlation and Wilkes Lambda tests. Discriminate functions are evaluated by estimating the probabilities of misclassifying the data [Ref. 6:pp. 909- 910] . However, discriminate analysis, requires a prior knowledge of the categories. Another statistical analysis method is the process of Reification [Ref. 2:p. 407]. Reification takes a set of database fields (objects) and collapses them into a subset of smaller object fields (called descriptors) that best describe the object's key attributes. Reification techniques include factor analysis, principle component analysis and 16 cluster analysis. As we shall see later in this chapter, developing the statistical attributes of database objects is a key component to Parsaye's knowledge discovery engines IXL (Induction with extremely Large databases) and IDIS (Information Discovery System) . As we have seen, statistical analysis has traditionally laid the foundation for Knowledge Discovery. The ability to classify (descriptive statistics) and formulate conclusions based upon sampling the data (inferential statistics) has been an important first step in knowledge discovery techniques. Further, statistical analysis is also crucial in measuring the quality of discovered knowledge. Parsaye elaborates on the need to extend the statistical approach in two basic ways. First, because statistics is a quantitative approach, output results come in the form of mathematical functions or equations. Since the average user of a database mining system is not a statistician, these complex representations must be converted into formats the user understands. Second, the statistical approach should be transparently built into the discovery system so that the user need not be a statistical expert. B. INDUCTIVE LEARNING Inductive learning is defined as the ability to describe a class from a review or analysis of the individual objects 17 in that class [Ref. 2:p. 404]. Inductive inference is the basis of inductive learning and is the process of generalizing assertions from specific observations about objects in classes of data. The inductive process takes an initial inductive hypothesis and develops assertions (rules) that can account for the observations of objects within the classes of objects. There are two basic approaches to the inductive rule formulation, data-driven and model -driven. Model -driven methods expresses an a priori model of the data, in terms of production rules, and then tests these models against an actual data set. In data-driven methods, the data is first analyzed and then inductive models are developed to define the data set. The latter approach, data-driven models, is used for knowledge discovery in this research. This approach is seen as a more flexible means of developing rule structures, particularly since little' a priori knowledge may be available about the data being analyzed. Additionally, true knowledge discovery of unanticipated results is more likely if the knowledge search is not hampered by constrictive biases introduced by a priori models. Using a data-driven modeling approach, William Messier and James Hansen [Ref. 7:p. 1414] found considerable success in developing quality production rules for expert systems. They point out, however, there are a few limitations when using an inductive approach. First, the 18 data-driven method is difficult to apply to very large databases. Rule development is prohibitive when the availability and range of the variables are large and diverse. Second, inductive approaches can develop significant errors if the data set contains missing or erroneous values. Missing important instances or attributes may lead to rules that are misleading. The ability to inductively generalize is an important process of knowledge discovery. The goal of inductive generalization methods in production systems is to find classes or a range of objects in the variable set that make the rules more applicable. More than one dependent variable category may be used to broaden the independent variables. The intent is to maximize the applicability of the attribute values. In rule induction, RHS dependent variables are broadened to include a wider range of the independent values. For example, a rule using a RHS dependent variable containing three categories, K x K 3 , could match a larger number of LHS independent variables (attributes) if the algorithm allowed for more than one RHS category to be included as an observation in the rule. Generalization can also be accomplished by backing off from the exact values of the independent variables in order to gain a larger sampling of the data set, e.g., variable X may only apply to few instances in ( 1 < X < 3) , where expanding the range of X ( - 100 < X < 500) may broaden the 19 applicability of X to a larger portion of the data set. Exact rules, on the other hand, limit the data sets to match one dependent variable and include only HIT-HIT situations. The following section discusses a few of the early inductive rule production systems based on the data-driven approaches. As these systems developed they found techniques for dealing with the above problems and evolved into more powerful knowledge discovery systems. 1. CLS and ID3 An early inductive learning technique is the Concept Learning System (CLS) algorithm developed by Hunt et al (1966). The object of the algorithm is to take objects of a known class (categories of the dependent variable) described in terms of the attributes (independent variables) to generate a production system which classifies these given objects. In CLS a decision tree is developed by repeatedly segregating the data into smaller and smaller subsets. These subsets each held certain characteristics of the attributes that classified them into separate categories. Quinlan (19 79) eventually developed CLS into a more sophisticated program. This program, ID3 , uses induction to develop a decision tree processes to classify and break the data down into a set of rules. ID3 ' s inputs are a known class of data described in terms of a fixed set of attributes. ID3 ' s output is a decision tree that classifies 20 the given cases of information. ID3 incorporates a top down induction approach using divide -and -conquer techniques to split the data into interesting attributes. The inductive algorithm constructs a rule by incrementally building a classification tree and repeatedly adding additional attributes to underspecif ied branches based on the estimated discriminatory power of that branch. This process continues iteratively until the data is fully partitioned. ID3 is perhaps the most popular method of decision tree analysis because it is easy to implement and produces simple decision trees effectively [Ref. 8:p. 172]. However, as Parsaye and Hansson [Ref. 9:p. 141] point out, ID3 can easily generate too many decision trees with meaningless or uninteresting rules. Additionally, they find several other faults in ID3 . First, ID3 is very sensitive to changes in the database. Small changes to the data can potentially produce large variations in the decision trees. Second, ID3 cannot generalize about the database. It produces exact decision trees, not general rules. Finally, similar to the second fault, ID3 cannot deal with inexact data and therefore cannot produce inexact recommendations. 2. AQ Algorithms Parsaye 's analysis of CLS and ID3 found that the decision tree approach to rule generation lacks robustness. Producing exact rules may help segregate the database into 21 unique subsets but severely cuts down on the ability to open up the data analysis to more useful general interpretations. The AQ family of algorithms is another approach to knowledge discovery that sought to overcome the lack of generalization. These algorithms use both type and structural information of the database to generalize based on a given data samples. Constraints are applied to limit the focus search patterns. The basic idea is to select one example, generalize on the example to determine how much of the generalization applies to the database, ensuring the generalization doesn't violate the constraints. The Star system (Michalshi 1983) is a "knowledge generalization" system because of its ability to use an inductive and generalization process with applied user constraints. Parsaye discusses a definite problem with the AQ based algorithms programs when applied to large databases. Too many generalizations may be produced causing the value of the conclusions to be too "liberal." Additionally, the AQ algorithms could not deal with inexact data and tended to produce nonsensical rules. The previous sections have discussed statistical inductive approaches to knowledge discovery. As pointed out, each approach has its own strengths and weaknesses. Statistical methods can be overly complex and difficult to use. On the other hand, inductive methods , such as CLS, ID3 and AQ may have a tendency to produce either too exact 22 or too liberal rules. A new approach is required that combines the strengths of both statistical and inductive methods while overcoming their inherent limitations. This new approach was introduced by Parsaye with IXL. 3 . IXL/IDIS Parsaye combined the statistical and inductive approach (machine learning) in his knowledge discovery program, IXL (Induction with extremely Large databases) . IXL "... form (s) and test(s) various hypotheses about relationships in the database and uses machine learning algorithms to generate rules." [Ref. 10 :p. 2] . At the heart of IXL, and the later version of the program IDIS, are two core modules. The statistical module analyses the data for statistical relationships and the artificial intelligence (AI) module interprets and tests these relationships then transposing them into easily understood rules [Ref. I0:p. 3]. The sophistication of IXL/IDIS comes into play when the program goes beyond sorting simple database descriptive statistics. By forming hypotheses on the statistical relationships, testing the hypotheses and repeating the process until patterns emerge, IXL/IDIS offers a clear alternative to the over generalization problems of CLS, ID3 and AQ algorithms. IXL/IDIS is able to circumvent the generalization problems that have plagued earlier knowledge discovery programs by allowing user defined 23 constraints on the AI module of pattern analysis. In addition, IXL has the ability to search large databases ( in the hundreds of millions of records) . For these reasons IDIS, the updated version of IXL, was chosen to benchmark the genetic programming approach developed at the Naval Postgraduate School . IXL/IDIS has seen some success in commercial applications. The U.S. Army and Air Force exchange is an example were it has been applied to determine sales patterns based on the customer demographics. IXL helped them target sales and advertising towards the appropriate customer base. [Ref . 9:p. 157] . C. SUMMARY This chapter we has reviewed the ability to capture large scale databases through mass storage and on-line databases connected by increasingly faster networks and the transfer of textual data to binary formats. Coupled with a vast increase in computer capacity, these innovations help deal with complex data and seek meaningful relationships among elements in the database. Recent attempts at knowledge discovery have shifted from traditional statistical techniques involving deductive analysis to more declarative techniques involving inductive analysis. Inductive algorithms, such as CLS and ID3 , have shown great promise but are too restrictive to produce general rules. On the 24 other hand, AQ algorithms create too many generalizations producing many nonsensical rules. A new breed of systems (such as IDIS) that combine both statistical and inductive approaches are beginning to exploit the strengths of both approaches . 25 IV GENETIC ALGORITHMS / GENETIC PROGRAMMING A. GENETIC ALGORITHMS Genetic Algorithms (GA) emulate the natural science of evolution, the science that Darwin interpreted as the evolutionary process that adapted species to their environmental conditions. Genetic algorithms can be viewed as an adaptive search procedure which is based on the model of population genetics and natural selection. Koza defines Genetic Algorithms as follows: Genetic algorithm is a highly parallel mathematical algorithm that transforms a set (population) of individual mathematical objects (typically fixed length character strings patterned after chromosome strings) , each with an associated fitness value, into a new population (i.e. the next generation) using operations patterned after the Darwinian principle. [Ref. 11: p. 18] Genetic algorithms solve problems by evolving potential solutions through a process of randomly recombining the critical aspects of the problem. Problem conditions, actions or characteristics are typically represented as binary strings of data which are combined through genetic operations such as mutation or crossover with other possible conditions, actions or characteristics of the problem. Eventually, an optimal condition is achieved after numerous generations of "breeding" possible solutions. Some of the 26 basic terminology used in genetic algorithms follows [Ref . ll:pp. 18-26] : 1. Population: This represents the set of candidate solutions, or individuals. 2. Generation: Is one iteration of the genetic algorithm. 3. Crossover: Is the process where two individuals in the population exchange parts of their internal representation to create new individuals. 4. Mutation: Is the random change of part of the internal representation of an individual. 5. Fitness: Is the method used to select individuals from the populations through using quantitative measurement. The chosen individuals represent those that exist and survive in the population that may successfully reproduce. 6. Reproduction: Is the process by which part of the parent is taken, and put in the new generation without any alternation. Genetic Algorithms operate on populations of data (individuals) where each represents a potential set of problem solutions. Individuals are selected during each generation by their "fitness" and then recombined through crossover. Offsprings may undergo mutation before they are inserted in the new population. This process continues until some specific termination criteria is met. Figure 1 shows a general flow chart of the genetic algorithm process operating on strings of individuals as depicted by John Koza [Ref. ll:p. 29] . 27 Gen := I Create Initial Random Population < ~ *— : s Ycs Termination \ ^. rit crion Satisfied?^ / | No Designate Result Evaluate Fitness of Each Individual in Population Gen := Gen + 1 i iuid I I No P f / Select Genetic OpcrationN 1^ " y Probabalistically J U Select One Individual liasaLoa EUntaa Select Two Individuals Based o n Fit ness 1 Perform Reproduction Copy into New Population i := i ♦ 1 ze: Select One Individual QaaaJ °p laincss Perform Crossover I Perform Mutation Insert Mutant into New Population Insert Two Offspring into New Population | t:=i + 1 Figure 1 Flow Chart of the Genetic Algorithm Process 28 John Holland developed the genetic algorithm and provided its theoretical foundation in his book Adaption in Natural and Artificial Systems [Ref . I2:pp. 31-58]. Although the basics of Genetic Algorithms have been around since the mid 1960s [Ref. 13:p. 9], GAs have primarily been utilized in optimization and classification problems. Rule induction with genetic algorithms is a relatively new approach. Kenneth A. Dejong suggests using GA approaches to solve rule production systems. By ".. .maintain (ing) a population of candidate rule sets," GA can utilize the selection process to produce optimal rules [Ref. 14:p. 625]. B. SAMUEL: A RULE GENERATING GENETIC ALGORITHM SAMUEL, standing for Strategy Acquisition Method Using Empirical Learning, is a genetic algorithm program developed at the Naval Research Laboratory. SAMUEL is designed to investigate behavior in simulation models. SAMUEL in essence is designed to learn rules for decision making agents. Coupling feedback mechanisms and performance evaluation techniques, SAMUEL seeks to learn optimal decision strategies through a set of rules that evolve over time. The goal of the program is to use a genetic algorithm to refine and improve upon an initial set of knowledge strategies, represented by a population of rules. Knowledge is represented in SAMUEL through three levels of knowledge structures: populations, plans and rules. 29 Populations consists of a set of plans to deal with environmental simulations in the program. Plans, in turn, are composed of specific condition-action rules that are intended to represent performance strategies to deal with specific environmental conditions. Each rule consists of IF- THEN conditional statements representing a set of conditions and appropriate responses to those conditions (actions) . An example rule might be: Rule 10 IF range = [250, 1000] and speed = [500, 1200] THEN SET turn = [0, 90] Performance evaluation of the competing plans is accomplished through the GA. Each generation contains a population of plans that compete and are measured by a fitness function. Based on the relative grades assigned by the fitness evaluation, the GA selects the best of the high performing plans for additional reproduction, crossover and mutation. The reproduction and testing cycles repeat until user specified criteria are meet. The advantage of SAMUEL is that it provides an important framework for developing a GP that analyses data to generate and evaluate production rules. SAMUEL is limited by the restrictive rule structure of IF-THEN statements. SAMUEL is also limited to rule structures of fixed length and does not incorporate combinations of logical operators, such as AND, 30 OR and NOT. These limitations on rule structure are removed by genetic programming. C. GENETIC PROGRAMMING For many problems in machine learning and artificial intelligence, the most natural known representation for a solution is a hierarchial computer program of indeterminate size and shape. [Ref . 11 :p. 210] Genetic programming is a variant of genetic algorithms with a different problem representation. The basic approach to genetic programming is summarized by John Koza in three steps as follows [Ref. ll:p. 213]. 1. Generate an initial population of random compositions on the functions and terminals of the problem (computer programs) . 2. Iteratively perform the following substeps until the termination criterion has been satisfied: a. Execute each program in the population and assign it a fitness value according to how well it solves the problem. b. Create a new population of computer programs by applying the following two primary operations. (i) Copy existing computer programs to the new population. (ii) Create new computer programs by genetically recombining randomly chosen parts of the two existing programs. (iii) These operations are applied to computer program (s) in the population chosen with a probability based on Darwinian fitness. 3. The single best computer solution in the population is designated as the result of the genetic program. The result may be a solution (or an approximate solution) to the problem. 31 The application of genetic programming is relatively new. For the most part these applications have focused on optimization and classification problems unrelated to generating production rules. [Ref. 15:p. 12]. 32 V NPS GENETIC PROGRAM (NPSGP) A. BACKGROUND The Naval Postgraduate School Genetic Program (NPSGP) adapts Walter Tackett's [1993] implementation of John Koza's genetic program developed at Stanford University. NPSGP supports rule generation. Constraints can be added to the structures evolved by the GP. Constraints are used to define the formats for -the rules to be generated by the system. A format specified in NPSGP dictates that rules follow the ( if X then Y ) patterns. In these rules, the dependent variable is on the right hand side (RHS) and independent variables are on the left hand side (LHS) . In genetic programming these independent variables are called terminal sets. Before going further into the details of the NPSGP system, the data set used to evaluate this program is described. This will assist in explaining the functions of NPSGP . 1. Data Description The data set used to demonstrate NPSGP contains quantitative data, specifically, the basic components of U.S. Gross Domestic Product (GDP) for the years 1929 to 1991 [Ref. 16:p. 341-474]. The components include Personal Savings, Disposal Personal Income, and Net Export s/ Import s . 33 Other relevant variables were also used, such as civilian unemployment rates and 10 -year Government Bond rates. Finally, some of this data was converted into ratio's , e.g., Personal Consumption Expenditures were divided by GDP. GDP Growth, the dependent variable, was defined as a change in GDP and was categorized into four discrete classes. The first category was assigned a value of high positive (HPOS) if Indexed GDP grew by more than 2.5 per cent from the previous year. Indexed GDP growth from to 2.5 per cent was categorized as low positive (LPOS) . GDP growth of to -2.5 per cent was categorized as low negative (LNEG) . Any negative growth greater than -2.5 per cent was classified as high negative .(HNEG) . The function set consisted of the operators in the rule structures (i.e., IF, OR, AND, NOT). The rule is represented by a collection of "terms" where each term specifies attribute-value ranges , called selectors. As an example, a selector including the attribute UNEMPLOYMENT may have an applicable range of to 25 per cent in the database; < UNEMPLOYMENT < 25%. Such a strategy offers robust rules through compensatory selection; the terminal set selected by one variable can compensate for by the terminal set selected by another variable. For example, the following rule offers a compensatory strategy for either the change in 10 Bond rates 34 or the change in personal savings and gross public debt as a percentage of GDP: GDP GROWTH = LPOS IF 2.2 < CHANGE IN THE 10 YEAR BOND RATE < 9.20 OR 0.9 5 < GROSS PUBLIC DEBT AS A % OF GDP < 2.16 AND 1.35 < CHANGE IN PERSONAL SAVINGS < 2.61 B. CONTROL PARAMETERS NPSGP offers a number of control parameters that the user can adjust to tailor the program to the type of data and analysis. 1. Fitness Measure The fitness measure is the quantitative evaluation of how well the rule matches the data. NPSGP uses a contingency table approach to establish a method of determine if the LHS of the rule classifies a RHS category, e.g., HPOS, LPOS, LNEG,or HNEG. First, an observation of the independent variable either belongs to a category (C+) or doesn't belong to that particular category (C-). Second, the LHS of the rule may be true for an observation (M+) or false (M- ) . The following contingency table shows the four possible conditions: 35 RIGHT HAND SIDE C+ C- LEFT HAND M+ A B SIDE M- C D A, B, C, D represents the total number of data points in each subset. For example, "A" represents the number of cases where the LHS and the RHS are true for an observation; this is a HIT-HIT condition. "B" represents number of cases where the LHS is true but the RHS is false; this is the MISS-HIT. "C" represents a HIT-MISS condition where the RHS is true but the LHS is false and "D" is the MISS -MISS condition where both the LHS and RHS are false. A fitness function may be defined in terms of the various cases in the contingency table. For example, the simplest method for computing fitness is to divide the MISS HIT value B by the HIT-HIT value A. To eliminate the possibility of division by or into zero, a small number is added to each of the parameters, i.e., (B+1)/(A+1). NPSGP minimizes the value of the fitness function. The following two examples illustrate how fitness function are used to evaluate rules. In both cases there are 63 data points for each attribute. The first case assumes that a rule produces 38 instances of HIT-HIT, category A, no instances of category B, MISS-HIT, and no instances of category C, HIT-MISS. By adding the constant 36 value of 1 to each of the parameters the following fitness measure is calculated. Fitness = (B+C+1)/(A+C+1) = (0 + + 1)/ (38 + + 1) = 0.025 In the second example, a rule produces 30 instances of HIT-HIT, category A, and 8 instances of MISS-HIT, category B, and 3 instances of MISS-HJ.T, category C. The fitness calculation then will yield: Fitness .= (B + C + 1)/(A + C + 1) = (8 + 3 + 1)/ (30 + 3 + 1) = 0.3529 The second rule did not fit the data set as well as the first rule. It had 8 cases where LHS rule fit the data set (interest rates, savings rates, etc.) but not the RHS dependent variable (GDP growth = HPOS ; LPOS ; LNEG; HNEG) and 3 cases where the RHS rule fit the dependent variables but not the data set. The higher fitness measure indicates the fit was not as good. 2. Selection Criteria Tournament selection is a method used to select the best fitting rules from the total generated population. It is similar to the way a winner is chosen from a tournament 37 among competitive teams. In NPSGP tournament selection was set at a value of six, resulting in a grouping of six rules from which the best is then chosen. Another method is a probabilistic selection method where the probability of selecting an individual depends on its fitness. For example, the best individual in the population may have a high probability ( close to 1.0), while individuals in the mid -range may have a probability of approximately 0.5. The worst individual of the group will have a probability of 0. [Ref . ll:pp. 604-607] . 3 . Population and Generation NPSGP also allows the user to select the population size to be specified. A population of 1,000 in this case means 1,000 rules will be initially randomly generated and then competitively adapted through crossover. Users can also select the total number of generations to be run, which serves as the termination point of the program. The total number of generations and population size required to produce an optimal solution may be determined by trying several combinations of these parameters. C. EVOLUTION OF RULES The program starts with an initial population of a number of randomly generated rules composed from the function and terminal variables sets. This initial population will usually contain a high percentage of poorly 38 fitting rules that are later refined by the crossover process. Reproduction and crossover operations are then applied to breed a new population of offspring rules. This process continues until the termination criteria is satisfied. Figure 2 outlines the major processes that the program follows in generating rules. GP Program Flow Generate population I Evaluate & Sort 1 l Breed New Population hu /Check ^* Termination J. Yes CStop^) * Figure 2 - GP Program Flow 39 D. AN EXAMPLE OF THE GP PROCESS The following illustrates the NPS genetic program output using the econometric data outlined in Appendix A. Program control parameters were set with a population size of 3000, crossover rate of 75 per cent, and termination parameter of 50 generations. The fitness measure (B + C + 1) / (A + C + 1) provides an initial means to minimize the number of HIT- MISSES and MISS -HITS. Although programmed in a "C" language shell, NPSGP produces rules using the LISP list format. Using the above fitness measure, NPSGP reported the best rule, e.g., the rule with the lowest fitness value, in generation as follows: Best of Generation 0: (IF (AND (TWIXT 4.760321 UNEMPCIV -3.134517) (TWIXT 1.770329 BOND10YR -15.607920) ) 3.000000.) Number of records matched by LHS : 2.000000 Number of misclassif ied records: 0.000000 Confidence: 1.000000 Validation Fitness= 0.333333 40 The rule can be easily translated to a more understandable format. The above rule translates to: IF -3.134517 < UNEMPCIV < 4.760321 AND -15.607920 < BOND10YR < 1.770329 THEN GDP GROWTH = HNEG As indicated above the LHS of the rule matched two observations in the data set. The confidence factor, used as a measure of exactness of a rule, is calculated by the formula A / (A + B) . This formula calculates the percentage of correct RHS classifications to total RHS classifications, when all LHS instances match the data* set. In this example there were no miss- classifications the formula computed to 2 /( 2 + ) or 1.0, representing 100 per cent confidence. The rule was exact in that it correctly classified all the records pertaining to that rule into a HIT-HIT situation. The fitness was computed as [B (0) + C (0) + 1] /[ A(2)+ C(0)+ 1] = 1/3 = 0.33333. Constant values of 1 were added to the fitness denominator and numerator to avoid division by zero. Finally, the numeric code 3 at the end of the rule represents the RHS category, e.g., GROWTH = HNEG. Better rules were evolved to fit the data set through generations of reproduction. As an example, the best rule of generation 3 follows: 41 Best Rule of Generation 3: (IF (AND (TWIXT 0.341340 GDBTPERGDP -2.211879) (TWIXT -4.860248 GDPIPERCHG 0.170821) ) 0.000000) Number of records matched by LHS : 32.000000 Number of misclassif ied records: 3.000000 Confidence: 0.906250 Validation Fitness= 0.315789 Translated: IF -2.211879 < GDBTPERGDP < 0.341340 AND -4.860248 < GDPIPERCHG < 0.170821 THEN GROWTH = HPOS The improvement in fitness, the NPSGP method of gauging rule improvement, is noted by the decrease from 0.333333 to 0.315789. This rule applies to different LHS attributes and RHS category. Finally, the rule has 32 HIT-HIT classifications and 3 MISS-HIT. Here the fitness formula used, A/ (A+B) , moved the GP toward more generalized rules. By generation 7 the fitness measure "optimized", the validation fitness value of the best rule minimized at .263158. As shown in the example below, this rule has a net decrease of one MISS -HIT observation from the above example. 42 Best Rule of Generation 7: (IF (AND (TWIXT -6.015694 PFDBTGDP 0.895278) (TWIXT 0.239258 GDPIPERCHG -24.963970) ) 0.000000) Number of records matched by LHS : 32.000000 Number of misclassif ied records: 2.000000 Confidence: 0.937500 Validation Fitness= 0.263158 Translated: IF -6.015694 < PFDBTGDP < 0.895278 AND -24.963970 < GDPIPERCHG < 0.239258 THEN GROWTH = HPOS Through crossover NPSGP was searching for the best rule that could fit both the RHS categories and LHS terminal set Although the best of breed fitness measure did not improve after this generation, changes continued in rule attributes of subsequent generations. For example, the top rule in generation 50 included the attributes GDBTPERGDP and GDPIPERCHG. It had the same number of classifications/ misclassif ications and confidence as the best rule in generation 7. However, the rule in generation 7 contained the attributes PFDBTGDP and GDPIPERCHG. 43 This improvement of the fitness measure is graphically illustrated in Figure 3. With the relatively small data set used in this application, the fitness measure very quickly converged to an optimal value. Fitness Fitness vs Generation 0.38 0.36 • 0.34 0.32- \ 0.3- \ 0.28 0.26 \ 0.24 0.22 01 23456789 Generation 10 Figure 3. Fitness verses Generation E. GP EVALUATION RUNS To test how sensitive the fitness measure is to changes in parametric inputs, the GP was run with different population sizes and crossover rates using the same fitness measure (B + C) / (A + C) . First, the crossover rate was varied holding the population constant at 3,000. Crossover rates of 60, 65, 70, 75, 80, 85, 90, 95 and 100 per cent were tried. The results indicate that changes in the 44 crossover rate had a minimal effect on fitness. In all cases, the fitness of the "best of generation" optimized in generation 7 with a value of 0.263158. This result may be due to the relatively small size of the database. A larger data set using real and discrete terminal values may have produced quite different results. Another test was performed changing the population size to determine its affects on fitness convergence. POPULATION 500 1,000 2,000 3,000 5,000 10,000 OPTIMAL GENERATION 12 8 15 7 11 16 OPTIMAL FITNESS 0.263158 0.263158 0.236842 0.263158 0.184211 0.038462 Table 1. Fitness vs Population As Table 1 indicates, the size of the population did affect the optimal fitness and the number of generations required to reach it. Increasing the population from 3000, to 5000, and then 10,000, lowered the fitness value. This indicates there is a minimum population size to produce the best rule results. The larger the selection pool the better statistical odds of producing rules that best model the data. The fitness measure (0.08333) of generation zero 45 (without crossover) using a population of 10,000 was better than the optimal fit of the smaller population after generation 7. F. SUMMARY Applying NPSGP to the data set revealed some important characteristics of GP rule generation. First, the composition of observations used in the fitness function is critical to the type and depth of rules produced by the GP. For example, the fitness function can be modeled to either produce generalized rules that include both HIT-HIT and MISS -HIT observations, or, it can produce exact rules using only HIT-HIT observations. This point is elaborated further in the next chapter. Second, it is important to review all rules produced by the GP that meet a desired minimum fitness measure due to the differences in the attributes and data range variations of those rules. Interesting rules and valuable data associations can be overlooked if users only focus on those rules where the program has minimized the fitness value. Third, population size has a significant affect on rule generation, including both fitness values and LHS attribute ranges . •lo VI COMPARISON OF KNOWLEDGE DISCOVERY RESULTS This chapter uses the data outlined in Appendix B to compare two known knowledge discovery approaches, i.e., statistical and inductive, against the NPSGP. The comparison is based on producing data element associations between the RHS categories and LHS attributes. As mentioned in Chapter III, IDIS was selected as representative of the inductive knowledge discovery approach because of it's ability to overcome the limitations of other inductive systems. While there are numerous statistical methods for discovering data element associations, stepwise discriminate analysis (SDA) was used to select which LHS attributes most significantly discriminated between the RHS categories. The SDA was done using the SAS statistical package and is described in greater detail in the following sections. The benchmark for comparing all three approaches, was the ability to identify associations among LHS data elements. SDA accomplishes this by identifying statistically significant independent variables that discriminate between dependent variable categories. This SDA list of significant variables is then compared to the LHS attributes produced in the NPSGP rules. The rationale behind this comparison is to determine if NPSGP could 47 develop data element associations that include variables found to be significant by SDA. Benchmarking IDIS to NPSGP meant reviewing the generated rules to compare LHS attributes. Any attribute listed in a rule was assumed to be significant. The LHS attribute ranges were not reviewed in detail nor was there an attempt to determine the usefulness of the rules. A. STATISTICAL ASSOCIATIONS SAS discriminate analysis procedures analyze data sets containing one dependent variable and several independent variables. The purpose of this discriminate analysis is to find the subset of those quantitative independent variables that are statistically "important" and can reveal differences among the dependent variable categories. However, SDA is not necessarily the best approach for all purposes. It has to be used carefully and reflect the user's knowledge of the data. [Ref. 6:p. 911]. Within the SAS discriminate analysis, the STEPDISC function was chosen to produce a discrimination model by selecting a subset of the quantitative variables based on one of two following criteria: 1. Analysis of covariance using the F-test to determine significance. The variables chosen act as covariants to the dependent variable (GDP Growth) . 48 2. Analysis of the squared partial (R 2 ) correlation value between the dependent variable (GDP Growth) and the independent variables, controlling for the effects of other variables selected for the model. [Ref. 6:p. 909] . Stepwise selection begins the process without any variables in the model. Variables are included into (or later excluded from) the model if they contribute to the discriminatory power of the model as measured by Wilke's Lambda. This process is repeated in steps until all variables meet either the criteria to stay or be dropped from the model. There is one potential weakness of this model: relationships between the unselected variables are not analyzed. This potentially excludes statistically significant variables. [Ref. 6:p. 910]. 1. Results: The results of using SDA on the 16 independent variables listed in Appendix B is illustrated in the Table 2. Table 2 first segregates the significant and non- significant independent variables in the model then lists these variables by decreasing order of partial R 2 value. The R 2 values represent the one-way analysis of covariance between the selected variable and the remaining variables not chosen for the model . 49 VARIABLE SELECTED VARIABLE NOT SELECTED PARTIAL R 2 P STATISTIC PROB > F STATISTIC DNEMPCHG 0.6217 32.316 0.0001 INDXGPDIPER 0.2276 5.697 0.0017 EXPPERGDP 0.2193 5.5151 0.0033 PCEPERCHG 0.2079 4.986 0.0039 POPCHGPER 0.1379 2.880 0.0443 GDPIPERCHG 0.1092 2.125 0.1082 BOND10YR 0.1066 2.228 0.0950 SAVPERDPI 0.1054 2 .0817 0.1138 UNEMPCIV 0.0935 1.753 0.1679 GDBTPERGDP 0.0837 1.553 0.2122 IDXPCEPER 0.0377 0.665 0.5773 DEFPEROUT 0.0312 0.548 0.6517 PFDBTGDP 0.0252 0.439 0.7263 SAVPERCHG 0.0190 0.329 0.8046 CHGBDRATE 0.0172 0.298 0.8264 GDEBTGDP 0.0070 0.120 0.9480 TABLE 2. Discriminate Analysis Results of Data Set 50 The discriminate analysis shown in Table 2 indicates that eight of the sixteen attributes were significant and can be used to discriminate between the four categories of the dependent variable (GDP Growth = HPOS, LPOS, LNEG, HNEG) . Error rate estimation (probabilities of misclassif ication) measures the discriminate model's performance. As shown in Table 2, five of the eight variables included in the model had error estimates (Probability > F Statistic) of less than 5 per cent while six of the eight nonselected variables had error estimates of greater than 50 per cent. Overall the analysis shows a strong discriminative association in eight of the sixteen independent variables. These results will be used to test the robustness of the rules produced using NPSGP and the same LHS attributes. B. IDIS As mentioned in Chapter III, IDIS uses a search algorithm blending statistical analysis with inductive heuristic learning. This algorithm is fixed within the program and search flexibility comes in setting the search control parameters, shown in Appendix B. These parameters allow IDIS to focus on generating rules from either narrow or broad search patterns. For example, liberalizing the program control parameters (confidence factor, error margin, 51 data range generality and minimum required records per rule) will produce rules with multiple LHS attributes and broad data value ranges. These rules may be too broad to be "interesting" and require further refinement. To focus these rules, control parameters are narrowed to reduce the selected LHS attribute range values, that is reduced attribute and range generalization. On the other hand, IDIS can also search for specific dependent variable categories. However, this may come at the expense of rule generalization and error estimation, particularly if the category represents only a small portion of the data set. Initially, a broad approach was used by selecting control parameters so that IDIS would search for a wide variety of rules linking LHS attributes to RHS categories. For example, control parameters were set at a certainty factor of 80 per cent, 25 per cent error margin, a minimum of 6 records per rule, 100 per cent range generalization and the dependent variable was set to produce rules for any of the RHS dependent categories (HPOS, LPOS, LNEG, HNEG) . However, these control parameters were too broad and only produced rules for HPOS. To search for rules in the other RHS categories, the control parameters were subsequently relaxed to include additional data elements. In addition, the dependent variable categories were focused on LPOS, LNEG and HNEG. The results of these searches are discussed below. 52 1. IDIS Results Initially IDIS produced rules including every LHS attribute, either singularly or in combination with other LHS attributes, but for only one dependent variable category, GDP GROWTH = HPOS . IDIS produced rules for GDP GROWTH = LPOS and LNEG only after restricting the RHS search to only those categories. The inductive rules for these categories are not as "good" as those for HPOS. Further, it was necessary to reduce the minimum record constraint to 3 and increase the margin of error to 50 per cent before the system produced any rules at all. The only LPOS rule contained three attributes; this indicates that IDIS could not find a singular or dual attribute rule for LPOS. This singular rule is shown below: GDP GROWTH = LPOS IF 0.3 <= GDPIPERCHG <= 7.6 AND 0.011 <= POPCHGPER <=. 0.0208 AND -0.1843 <= GDBTPERGDP <= -0.0039 CONFIDENCE FACTOR = 66.67 % MARGIN OF ERROR = 3 6.4 % NUMBER OF APPLICABLE RECORDS = 9 53 The only rule produced where GDP GROWTH was LNEG follows: GDP GROWTH = LNEG IF l.-l <= UNEMPCHG <= 2.9 CONFIDENCE FACTOR = 63.64 % MARGIN OF ERROR = 33.0 % NUMBER OF APPLICABLE RECORDS = 11 These results imply that a fixed search algorithm may not identify all potential rules in the RHS categories. The IDIS user may need to force the search for different RHS categories. As a tradeoff to finding these rules the user may be required to accept lower confidence factors and higher error estimates. C. NPSGP NPSGP uses a random and evolutionary approach (through adaptive competition) . There is no fixed search pattern built into the initial rule generation process. Unlike traditional methods which sometimes presume a specific structure (i.e., liner compensatory or correlated attributes) , NPSGP starts off with a set of randomly selected rules [Ref 3:p. 219] . Structural bias may be introduced as part of the fitness function that discriminates between these rules in later generations. NPSGP offers flexibility to focus the data mining because it 54 is easy to change the fitness function. The fitness function's effect on rule selection and along the NPSGP program parameters used in this research are discussed further in the following sections. 1. NPSGP Parameters • For examples discussed in this chapter, NPSGP control parameters were set to the following values: 1. Population size: 5,000. 2. Crossover rate: 75 per cent. 3. Random generator seed value: 4. 4. Function Set was originally limited to one conjunction (AND) and later modified to include an arbitrary number of logical operators (AND, OR and NOT) . Population was set at 5,000 to limit CPU time and memory used in the analysis. As indicated in chapter V, crossover had a small effect on GP rule generation. An arbitrary crossover value of 75 per cent was assigned. Introducing of a random seed starts the random number generator used in selecting the initial rules. Changing the random generator seed number has a small effect on the types and ranges of the rules generated. However, these differences are expected to be normalized over a number of runs. In addition to the above parameters, NPSGP also allows the user to select the number of rules N, it shows at the end of each generation. These "top N" rules are listed in descending order of computed fitness value. To limit the 55 rule review process we set the list to show only the top 30 rules for each generation. 2. Fitness Functions This section illustrates how changing the fitness function can refocus rule discovery. Ultimately, this leads to different knowledge associations (rules) than either the statistical discriminate approach or the combined efforts of the singular search function deductive/inductive IDIS approach. A variety of fitness functions were tested in NPSGP, most yielding different sets of rules. To illustrate the impact that changing the fitness function has on the rules generated, the following two fitness functions were selected: Fitness=0.8- A+B+l (1) Fitness- 0-25S+0.5C+1 2A+0 .5B+C+1 (2) The intent of fitness function (1) was to force the system to generate rules that are correct approximately 80 per cent of the time. Since NPSGP minimizes the fitness 56 value, subtracting the calculated value of ( A / (A+B) ) from 0.8 (80 per cent) will drive the value of the confidence equation , ( A/ (A+B)), to 80 per cent. Therefore, the fitness function (1) will be optimized when (A / (A+B) ) approaches 0.8 (assuming it is <= 0.8). When (A / (A+B)) is greater than 0.8, the fitness is set at an arbitrary high number so that the system will avoid such solutions. Fitness function (2) uses a weighted approach that assigns relative importance to the different observations, e.g., A, B, C, D. This approach demonstrates that rule search can be focused for a specific mix of observations. This increases rule exactness or generality, as desired by the user. For example, when the aim is to produce exact rules, observation A (HIT-HIT) receives a greater weight than the other observations. To broaden the rule applicability to the RHS, observation B (MISS-HIT) is assigned greater numeric weight. To produce LHS rule generality, observation C (HIT-MISS) is given more weight. Fitness function (2) uses this weighted approach to introduce more LHS and RHS generality" into the rules. This is accomplished by emphasizing A (HIT-HIT) , limiting emphasis on C (HIT-MISS) , de- emphasizing B (MISS -HIT) and ignoring D (MISS -MISS) . The following sections apply these two fitness functions to the data set. 57 c. Example C: The following rule is representative of the rules produced with fitness function (2) without the singular conjunction restraint. Again, rules including other categories (HPOS, LNEG, HNEG) were not among the top 30 rules . 1. Generation 30, Rule 1: GDP GROWTH - LPOS IF 3.99 < UNEMPCIV < 5 . 84 AND -1.55 < SAVPERCHG < 0.77 AND -2.10 < PCEPERCHG < 2.36 FITNESS = 0.106061 CONFIDENCE = 100.00 % APPLICABLE RECORDS - 7 4. Function Set Results In the above examples, rule search patterns have focused on simple rule associations using one conjunction (AND) , both for IDIS and NPSGP. Knowledge about complex data associations is not adequately represented using only- singular (AND) conjunctions. Rules representing complex associations between data elements can be improved by widening the choice of logical operators to include (OR and NOT). For example, the logical operator "OR", provides rules where there is choice between equivalent attributes ( IF X or Y THEN Z) . The operator "NOT" is used to depict exceptions in the rule ( IF X and NOT Z, THEN Y) . 60 To illustrate, the program using function set (1) was modified to permit unlimited combinations of logical operators (AND, OR, NOT) . NPSGP was able to produce rules using all three logical operators. Example D, demonstrates this additional flexibility. It provides a sample of the rules NPSGP produced using multiple logical operators. IDIS dose not provide this ability. a. Example D: 1. Generation 4, Rule 1: GDP GROWTH = LPOS IF 1.84 < CHGBDRATE < 5.69 AND -1.56 < GDPIPERCHG < 2.23 OR 3.9 8 <UNEMPCIV < 5.41 OR NOT 0.49 < UNEMPCIV < 15.99 AND -0.40 < GDPIPERCHG < 4.116 FITNESS = 0.02222 CONFIDENCE = 75.00 % APPLICABLE RECORDS = 8 D. COMPARISON OF RESULTS Comparing the data element associations developed by the three different approaches brought out two significant differences. First, the ability to focus the search algorithms ( i.e., "fitness function"), to force the search pattern toward exact or general patterns is the most important component of any knowledge discovery program. If the algorithm can not be focused, the knowledge discovery 61 program is locked into the fixed search pattern defined by that algorithm. With IDIS, control parameters used in conjunction with the search algorithm can be changed. However, this dose not change the fixed search algorithm itself. This limitation made it difficult for IDIS to find rules for the RHS categories LPOS, LNEG and HNEG. NPSGP was able to discover some rules in these categories using fitness function (1) . Additionally, the discriminate analysis function was limited to statistical association tests, such as the F test, R 2 partial correlation and Wilke's Lambda tests, to find data element associations between dependent and independent variables. This stepwise discriminate data analysis eliminated some interesting attributes from further analysis. These attributes were used to produce rules in both IDIS and NPSGP. For example, NPSGP developed rules using the eight significant variables in the discriminate analysis. It also developed rules wi.th a few of the variables classified as non-significant ( e.g., Example C included SAVPERGHG, Example B included GDEBTGDP and Example D included CHGBDRATE) . For the second difference, including additional logical operators ( NOT and OR) increases robustness in rule generation. It provided both exceptions to the rule and/or equivalent attributes within the same rule. Due to it's limited rule structure, IDIS does not include additional 62 logical operators. It is limited to the single conjunctive "AND." NPSGP's flexibility to include "OR" and "NOT" operations leads to more complex rules and associations that allow deeper data mining with complex data element associations. 63 VII CONCLUSIONS AND RECOMMENDATIONS A. CONCLUSIONS The focus of this research is to compare several methods of knowledge discovery according to their ability to identify associations in data elements. One of the methods chosen for this purpose was a statistical approach, discriminate stepwise analysis. Stepwise discriminate analysis identifies independent variables that statistically discriminate against various dependent variables. The second method IDIS, combines statistical and inductive methods. IDIS represents associations between data elements by production rules. The last method is a genetic program that evolves rules through competitive adaption. This analysis compared the structural characteristics of the rules generated, rather than their usefulness. It should be noted that the definition of usefulness is very subjective and depends heavily on the purpose for which the rules were generated. The test data set used in this study combined Gross Domestic Product information and economic indicators . Several findings resulted from the analysis. First, the NPS GP model was able to develop rules involving more RHS categories than IDIS. In one GP program run (Example A in 64 Chapter VI) , rules were produced for all four dependent variable categories. IDIS produced an abundance of rules using all attributes for HPOS . However, IDIS was unable to produce any rules for HNEG. Relaxing IDIS's constraints (control parameters) to very low confidence levels and high error estimates only produced one rule LPOS . NPSGP also found rules with more LHS attributes. Further more, the stepwise discriminate analysis found eight statistically significant LHS attributes. NPSGP found rules involving three additional LHS attributes; SAVPERCHG, GDEBTGDP and CHGBDRATE . / Second, NPSGP demonstrated that crossover had a limited effect on the GP's rule generating capabilities for small quantitative data sets (16 attributes* by 63 instances) . Population size was more important in GP generated rules, as measured by the fitness function. Population size primarily affected the selection and range of the attributes chosen. Third, fitness function (search algorithm) is the most important parameter used in knowledge discovery programs. It sets the search focus for selecting independent and dependent variables. The fitness function is used as a filter to select rules that fit into a particular schema, e.g., exact or general rules. IDIS and stepwise discriminate analysis are locked into a fixed search pattern. The flexibility to change the fitness function 65 (search algorithm) gives the NPSGP program a significant advantage . Finally, the ability to expand the function set by adding of logical operators such as "OR" and "NOT" gives NPSGP the potential to handle complex rule associations not covered by IDIS or statistical methods. B. FUTURE RESEARCH Recent developments in genetic programming offer great potential for knowledge discovery. GP is not restricted to a priori deductive and inductive paradigms that limit the scope of knowledge discovery. It offers a radical breakthrough that cuts through limitations of traditional approaches. NPSGP has shown that genetic programming is a viable approach to knowledge discovery, but the program is still in its infancy. The following recommendations are offered to help develop NPSGP into a mature and robust system with greater applicability to a diverse range of knowledge discovery. 1. Fitness Function Applications: This study demonstrated that fitness function plays a critical role in determining the scope and depth of the rules generated. Comparative research is needed to determine the optimal type(s) of fitness measures to use in GP. Fitness functions should be tailored to the different data schema, e.g., quantitative, scaler, combinations of 66 scaler and nonscalar, etc. This would enhance GP capabilities. For example, a weighted fitness measure ( ( B± + V^) / A^; were W i represents the number of records in each category i divided by the total number of records) was successful in focusing the rules toward the smallest category in the data set, GDP GROWTH = HNEG. Additionally, this study used a relatively simple database of quantitative data to develop and test NPSGP. Other more complex data sets may require analysis based on fitness functions using non- linear logarithmic or geometric relationships between data elements, attributes or dependent variables . Finally, using parallel GP programs with multiple fitness measures for the same data search may expedite the knowledge discovery process. 2. Different Control Parameters Because NPSGP is in the initial testing stage, a complete diagnosis of its capabilities was not feasible. Running GP with multiple data sets and experimenting with the GP control parameters would be very productive. At a minimum, the following control parameters should be studied to determine their potential impact on rule generation: a. Mutation vice Crossover, or both. b. Tournament size. 67 c. Minimum population size as a factor of data size. The use of an extremely large population size; 100,000 and above . 3. Develop Measurements of Rule Fit Another potential topic of research is to develop a fitness measure that indicates how the LHS values "fit" the range of data element values. For example, computing standard deviations for quantitative data and then comparing these against the range of LHS values in the rule may- indicate how tight the rule value ranges fit the data set. This measure may indicate how narrow .or broad the rule fits the range of attribute values, allowing another quantitative measure of the quality of the rule. 4. Database Objects The ability to define virtual data attributes offers great potential for knowledge discovery systems. Defining new objects by combining attributes and/or specifying ranges for attribute values focuses the knowledge discovery system on multiple dependent variables packaged into a "virtual" attribute. For example, a virtual attribute "EQUIPMENT CASUALTY" may be used to define an equipment failure as a combination of attributes OPERATIONAL, INSTALLATION CODE, CRITICAL MISSION CODE, etc. If the equipment is non- operational, installed on the aircraft and critical to a mission area, you have an instance of EQUIPMENT CASUALTY. 68 This "virtual" attribute would help focus the knowledge discovery search on a predefined set of attribute values. C. FUTURE DOD APPLICATIONS The future of knowledge discovery systems lies in its ability to produce meaningful associations from data elements in existing databases. This research has demonstrated that new knowledge discovery systems involving genetic programs', such as NPSGP, offer significant potential for knowledge discovery. As the largest single organization in the world, the Department Of Defense (DOD) has an abundance of databases that could be analyzed using the knowledge discovery techniques outlined in this thesis. The range and use of these databases are as varied as in the commercial sector. IDIS, for example, is already being used by the U.S. Army and Air Force Exchange systems. Logistic and maintenance system applications are but two additional future targets for knowledge discovery systems. The potential benefit in applying GP knowledge discovery systems to DOD databases is conceptually large, offering insights to undiscovered data relationships that may have operational and financial implications. 69 APPENDIX A TEST DATA Test data used in evaluating the statistical, IDIS and NPSGP programs consisted of the following econometric data: 1. YEAR: The corresponding year for the data. 2. GDP GROWTH: As explained in chapter IV, GDP Growth was assigned a discrete value of four categories; high positive (HPOS or 0) , low positive (LPOS or 1) , low negative (LNEG or 2), and high negative (HNEG or 3) . These categories were determined by using the growth (decline) in Indexed GDP from the previous year. Indexed GDP was taken from [Ref. 16 :p. 341-474] and represents an implicit price def-lator (1987 = 100) . HPOS corresponds to a growth greater than 2.5 per cent. LPOS represents a growth from to 2.5 per cent and LNEG represents growth from -0.001 to -2.5 per cent. HNEG growth represents any decline in GDP greater than -2.5 per cent. 3. IDXPCEPER: Indexed Personal Consumption (1987 = 100) as a percentage of GDP. 4. PCEPERCHG: The change in indexed Personal Consumption Expenditures from the previous year. 5. INDXGPDIPER: Indexed Disposal Income as a percentage of GDP. 6. GDPIPERCHG: The change in Personal Income as a per cent of GDP from the previous year. 7. EXPPERGDP: Exports as a per cent of GDP. 8. SAVPERDPI: Personal Savings as a percent of Disposal Personal Income. 9. SAVPERCHG: The change in Personal Savings from the previous year. 10. POPCHGPER: The percentage change in U.S. population from the previous year. 70 11. BOND10YR: The rate on 10 year U.S. Treasury Bonds, or equivalent. Years 1930 - 1932 and 1934 - 1938 were estimates. 12. CHGBDRATE: The change in the 10 year U.S. Treasury bond rate, or equivalent, from the previous year. 13. UNEMPCHG: The percentage change in civilian U.S. unemployment from the previous year. 14. UNEMPCIV: U.S. civilian unemployment. 15. DEFPEROUT: The U.S. deficit as a percentage of U.S. government outlays . 16. GDEBTGDP: U.S. government gross debt as a per cent of GDP. 17. PFDBTGDP: U.S. government public debt as a per cent of GDP. 18. GDBTPERGDP: The percentage change in U.S. government gross debt as a per cent of GDP from the previous year. 71 APPENDIX B IDIS PARAMETERS IDIS provides adjustable control parameters permitting the user to focus rule search toward specified goals. These parameters are listed below with brief explanations. 1. SAMPLING PERCENTAGE: This parameter is used to set the proportion of the database to be used in knowledge discovery. 2. INTEREST LEVEL: This parameter is used to set the interest level of the independent variables. INTEREST LEVEL sets priorities on those LHS attributes selected in the rule search. The higher the setting the higher was the user's interest in that attribute. 3. MAXIMUM LENGTH OF RULE: This parameter sets the maximum number of LHS attributes that appear in the rule (i.e., the number of combine LHS attributes the generated rule should not exceed) . 4. MAXIMUM MARGIN OF ERROR: The MARGIN OF ERROR calculated in rule production gives the range of estimated error of the rules computed confidence level. As a control parameter setting, MARGIN OF ERROR expresses the user's tolerance for error in the estimate of the confidence level. Calculation of this parameter is proprietary and has not been disclosed by IntelligenceWare, Inc., the developer of IDIS. 5. MINIMUM RULE CONFIDENCE: RULE CONFIDENCE is calculated in IDIS by dividing HIT-HIT observations by HIT-HIT and MISS -HIT observations. As a parameter RULE CONFIDENCE sets desired rule generality by specifying the minimum acceptable ratio of observations that fall into this calculated range. 6. MINIMUM PER CENT OF DATABASE FOR RULE FORMATION: This parameter specifies the minimum percentage of LHS instances, as a per cent of total instances, that must be contained in the rule. It is the number of records in the data set, as a percentage of total records, that are included in the rule. 72 7. MINIMUM NUMBER OF RECORDS USED FOR RULE FORMATION: This parameter is the integer equivalent of the MINIMUM PER CENT OF DATABASE FOR RULE FORMATION. 8. MAXIMUM GENERALITY: This parameter specifies the largest scaler range of the LHS attributes permissible for rule generation. 73 APPENDIX C NPSGP USERS GUIDE The following lists a set of procedures that can be used to execute NPSGP. This listing contains a minimum set of user guidelines required to run the program using a UNIX workstation. 1. Load the data set either as an ASCII file. Delete all irrelevant data and insure the dependent variable is in defined in the first column. Saves the file as a text file with the name (econ.tab). 2. Convert the ASCII file to a "C" program structured file using the command: perl define.pl -15 econ.tab NOTE: "-15" sets the lag time for the data, the default value if none is entered is 5. 3. Recompile the program by typing the MAKE command. 4. NPSGP program is executed using the following commands: a. To start a new program: gpc 1 50 none 4 b. To restart a program from a failure checkpoint: gpc -r 1 5 none 4 NOTE: »-r" specifies the restart from a checkpoint; "50" specifies the total number of generations to run; "4" represents the seed number. Changing the default settings of population size, crossover rates, random seed, tournament selection, etc. can be done through the program module "Default . in." . Changing these parameters does not require recompilation of NPSGP. 74 6. Changing the fitness function can be done in the program module "Fitness. C" . NOTE: Changing this module REQUIRES recompilation of NPSGP. 75 LIST OF REFERENCES (1) Frawley, William J. , Piatetsky-Shapiro, Gregory, and Matheus, Christopher J. , Knowledge Discovery in Databases: An Overview, KNOWLEDGE DISCOVERY ON DATABASES, 1991. (2) Parsaye, Karman , Chignell Mark ,Knoshafian Setrag and Wong Harry, Intelligent Databases, Object-Oriented, and Deductive Hypermedia Technologies , 19 89. (3) Green, D. P.. and Smith, S. F., A Genetic System for Learning Models of Consumer Choice , Proceedings of the Second International Conference on Genetic Algorithms, MIT. Cambridge, Hillsdale, N. J. rLawrence Erlbaum, 1987. (4) Winkler, R. L. and Hays, W. L., Statistics: Probability, Inference, and Decision, Second Edition, Holt, Rinehart and Winston, 1975 (5) Weiss, Neil A., and Hassett, Matthew J., Introductory Statistics, Third Edition, 1991. (6) SAS Institute Inc. , SAS/STAT User's Guide, Release 6.03 Edition, Cary, NC: SAS Institute Inc., 1988. (7) Messier, William F., Jr, and Hansen, James V., Inducing Rules for Expert System Development: An Example Using Default and Bankruptcy Data, Management Science. (8) Bundy, Alan, Silver, Bernard and Plummer, Dave, An Analytical Comparison of Some Rule-Learning Programs , Artificial Intelligence, April 1985. (9) Parsaye, Kamran and Chignell, Marke, Intelligent Database Tolls & Applications: Hyperinformation Access, Data Quality, Visualization, Automatic Discovery, John Willey and Sons, Inc. 1993. (10) Parsaye, Karman, Wiat can IXL do that Statistics cannot? , IntelligenceWare, 1990. (11) Koza, John R., Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992. 76 (12) Liepins, G.E. and Hillard, M.R., Genetic Algorithms: Foundations and Applications , Annals of Operations Research, 1989. (13) Burtka, Michael, Genetic Algorithms , The Stern Information Systems Review, published by the Stern School of Business, New York University, Spring 1993. (14) Dejong, Kenneth A., Genetic- Algorithm-Based Learning: Machine Learning Volume III , Navy Research Laboratory, 1990. (15) Nissen, Vollcer, Papers on Economic and Evolution ti 9303 ; Evolutionary Algorithms in Management Science, July 1993. (16) Council of Economic Advisers, Economic Report of the President , January 1993 , Published by the United States Printing Office, Washington 1993.' 77 INITIAL DISTRIBUTION LIST 1. Defense Technical Information Center Cameron Station Alexandria, Virginia 22304-6145 2. Library, Code 52 Naval Postgraduate School Monterey, California 93943-5002 3. Mohammed A. Al-Mahmood P.O.Box 5998 Manama, State of Bahrain 4. Steven L. Smith 12 09 7 Winona Drive Lake Ridge, Virginia 22192 5. Balasubramaniam Ramesh, Code RA Naval Postgraduate School Monterey, California 93943-5002 6. William Gates, Code GT Naval Postgraduate School Monterey, California 93943-5002 78 DUDLEY KNOX JBR NAVAL POSTGRADUATE SChO^. MONTEREY CA 93943-5101 GAYIORD S