Information Science G2.3 Genetic information learning
نویسنده
چکیده
Supervised learning in attribute-based spaces is one of the most popular machine learning problems studied and, consequently, has attracted considerable attention from the evolutionary computation community. The problem studied here is typical—determining optimal symbolic descriptions for a concept, for which positive and negative examples are provided along with an appropriate language. Key difficulties stem from such concept descriptions being sets of elementary descriptions. The approach presented here uses a variable-length representation—each chromosome represents a complete set of these elementary elements. Another difficulty lies in the gap between the abstract variablelength phenotype and the often used binary genotype. This problem is avoided by defining the evolutionary search at the phenotype level. Finally, most other evolutionary approaches suffer from high time complexity. The approach presented in this case study alleviates this problem by utilizing problem specific search operators and heuristics and by precompiling data to facilitate faster evaluations. G2.3.1 Project overview G2.3.1.1 Problem description Supervised concept learning is a fundamental cognitive process that involves learning descriptions of some categories of objects. Precategorized example objects constitute a priori knowledge. Acquired descriptions, often in the form of rules, can subsequently be used to both infer properties of the corresponding concepts (characteristic descriptions) or to decide which category new objects belong to (discriminant descriptions). Table G2.3.1. Attributes and domains. Attribute Domain values HeadShape Round, Square, Octagon Body Round, Square, Octagon Smiling Yes, No Holding Sword, Balloon, Flag JacketColor Red, Yellow, Green, Blue Tie Yes, No Consider the problem of learning discriminant robot descriptions in an environment using the attributes of table G2.3.1 (subsequently used abbreviations are boldfaced). The objective is to discover the following (unknown) robot description: HeadShape is Round and JacketColor is Red or HeadShape is Square and Holding a Balloon. This problem is taken from the article by Wnek et al (1990). This robot world is very suitable for this kind of experiment, since it is moderately complex to allow comparative study, yet simple enough to be illustrated by the diagrammatic visualization method. c © 1997 IOP Publishing Ltd and Oxford University Press Handbook of Evolutionary Computation release 97/1 G2.3:1 Genetic information learning The task is to learn the description while seeing only random samples of the category (positive examples) and random samples of the countercategory (negative examples). There are 3×3×2×3×4×2 = 432 different robots present in this world—84 belong to the category. G2.3.1.2 Concept description language The input language serves as an interface between the environment (teacher) and the system. It should minimize data inconsistency. The output language serves as an interface between the system and the application environment. It should maximize descriptive power. If both languages are the same, it is generally easier to describe, verify, and understand processing mechanisms. One language suitable for both the input and the output is VL1 (for a brief description and further references see Michalski 1983). Variables (attributes) are the basic units having multivalued domains. According to the relationship among different domain values, such domains may be of different types: nominal—unordered sets; linear— linearly ordered sets; structured—partially ordered sets. Relations associate variables with their values by means of selectors (conditions) having the form [variable relation value], with the natural semantics. For the ‘=’ relation, which we use here, Value may be a disjunction of domain values (internal disjunction). Conjunctions of selectors form complexes (rules). A full description is given by a disjunction of complexes (set of rules). Given our description language, the sought concept is: [H=R][J=R]∨[H=S][Ho=B]. G2.3.1.3 System architecture Because of our approach—which uses existing inductive operators as search means, is guided by Darwinian selective pressure and problem heuristics, and uses operators and evolutionary ideas as means to control the search (section G2.3.2)—the overall system architecture closely resembles that of an AI production system. It is illustrated in figure G2.3.1. Population represents the current state of the search-space exploration. This state is modified by applications of Darwinian selection and the inductive operators, which are biased by problem heuristics (section G2.3.2.6). Database: population Productive rules: inductive operators Control Darwinian selection problem heuristics Figure G2.3.1. System architecture. G2.3.2 Design process G2.3.2.1 Motivation Supervised learning is characterized by relatively large search spaces. For our problem, there are 7 × 7 × 3 × 7 × 15 × 3 = 46 305 valid rules (there are 23 − 1 = 7 valid selectors for HeadShape, etc). This gives about 1015 000 different rule sets (i.e., the power set of the number of rules—infinitely many if duplications are not taken care of). This means that no exhaustive method could be used. Whatever the specific objectives, supervised learning also exhibits multimodality. This means that no hill climbing techniques could be used. Two obvious approaches are to use heuristics to guide the search or to use evolutionary techniques to provide more robust search. Known procedural/heuristic methods include rule discovery AQ systems (Michalski et al 1986) and decision trees (Quinlan 1986). Evolutionary approaches include classifier systems (see e.g. Riolo 1988) and genetic algorithms (see e.g. Spears and De Jong 1990). B1.5.2, B1.2 Our objective is to combine the two approaches in an evolutionary method guided by heuristics. c © 1997 IOP Publishing Ltd and Oxford University Press Handbook of Evolutionary Computation release 97/1 G2.3:2 Genetic information learning As heuristics, we use inductive operators and refinement bias. Michalski (1983) provides a detailed description of various inductive operators that constitute the process of inductive inference. In the language VL1, example operators include condition dropping from a rule, adding an alternative rule or dropping a rule, extending a condition, or closing an interval in a linear condition. These operators are either generalizing or specializing existing knowledge. For example, dropping a condition generalizes a rule, while dropping a rule specializes a rule set. Moreover, needed knowledge refinement is said to be heuristically dependent on the current rules. For example, an overgeneralized rule (i.e. covering some negative examples) should be further specialized (e.g. by adding conditions). Having the operators, evolutionary control (population and Darwinian selective pressure) is used to apply the operators. Application probabilities are further guided by the refinement bias, as described below. G2.3.2.2 Requirements There are two requirements guiding our approach: improving speed of other existing genetic algorithms and also producing descriptions of low complexity. Efficiency is addressed by data compilation (section G2.3.3.1) speeding up evaluations, as well as by incorporating heuristics and by using the inductive operators to guide and conduct the search. Description complexity affects human understanding of the generated knowledge. Following Wnek et al (1990), we defined complexity as twice the number of rules plus the total number of conditions. Reducing so-defined complexity also means that no redundant rules should be retained. However, such redundant rules are not removed by explicit mechanisms—Darwinian pressure coupled with proper fitness provides implicit means. G2.3.2.3 Representation Each chromosome is capable of representing an unrestricted number of rules. Each rule (complex) is represented by a number of conditions. Because the possible conditions are determined from the problem specific language (such as the attributes of table G2.3.1), each rule can be restricted syntactically (section G2.3.3). For dual categories, a rule does not have to explicitly represent a category designation. Instead, it is assumed that the represented rule set describes the category (a concept), and that anything not covered by this description represents negation of the concept. There are no preset restrictions on the number of rules per chromosome other than physical memory limitations. G2.3.2.4 Fitness Fitness must reflect learning criteria. In supervised learning two criteria are typically used: description completeness (positive-example coverage) and consistency (avoidance of negative-example coverage). In an evolutionary algorithm, all objectives are usually combined to form a single fitness value. Accordingly, we define: correctness = w1 × completeness+ w2 × consistency w1 + w2 where the two coefficients can be used to bias the search toward more complete or more consistent formulas. Completeness and consistency are defined in table G2.3.2, where e+ (e−) is the number of positive (negative) training examples currently covered by a rule, ε+ (ε−) is the number of such examples covered by a rule set, and E+ (E−) is the total number of such training examples. These two measures are meaningful only to rule sets and individual rules. For conditions, measures of the parent rule are used. These definitions assume a full-memory model (all training examples are retained). Table G2.3.2. Attributes and domains. Syntactic structure Completeness Consistency A rule set ε+/E+ 1− ε−/E− A rule e+/ε+ 1− e−/ε− c © 1997 IOP Publishing Ltd and Oxford University Press Handbook of Evolutionary Computation release 97/1 G2.3:3 Genetic information learning One of our objectives is to produce descriptions of low complexity. For example, this is necessary to promote dropping redundant rules. Suppose cost denotes normalized measure of complexity. Then, fitness is determined by fitness = correctness× (1+ w3(1− cost)) (G2.3.1) where w3 determines the influence of the cost, and f grows very slowly on [0, 1] as the population ages. The effect of the very slowly rising f is that initially the cost influence is very light in order to promote wider space exploration, and it only increases at later stages in order to minimize complexity. G2.3.2.5 Operators There are three syntactic levels in the chromosomes: conditions, rules, and rule sets. Different operators apply to different levels. Moreover, some operators specialize, while others generalize the existing knowledge. Table G2.3.3. Genetic operators. Syntactic level Knowledge refinement Rule set Rule Condition Generalization Rules copy Condition drop Reference extension New event Turning conjunction into disjunction Rules generalization Specialization Rules drop Condition introduce Reference restriction Rules specialization Rule-directed split Independent Rules exchange Rule split All the operators are listed in table G2.3.3. For a complete description, see the article by Janikow (1993). Here we illustrate two of them, using our descriptive language and the idea of diagrammatic visualization, which is an extension of the well-known Karnaugh maps (Wnek et al 1990). The illustrated operators are • rules generalization—this operator modifies one chromosome by selecting two random rules and replacing them with their most specific generalization (figure G2.3.2)—and • reference extension—this operator modifies a single condition of a selected rule. If the condition is a restriction on a linear attribute, closing the reference interval has a higher probability (Michalski 1983). For nominal attributes, selected restrictions on the attribute are dropped. This operator is illustrated in figure G2.3.3. T YNYNYNYNYNYNYNYNYNYNYNYN J R Y G B R Y G B R Y G B Ho S B F S N Y NY NY N Y NY NY N Y NY NY B
منابع مشابه
Optimization of e-Learning Model Using Fuzzy Genetic Algorithm
E-learning model is examined of three major dimensions. And each dimension has a range of indicators that is effective in optimization and modeling, in many optimization problems in the modeling, target function or constraints may change over time that as a result optimization of these problems can also be changed. If any of these undetermined events be considered in the optimization process, t...
متن کاملOptimization of e-Learning Model Using Fuzzy Genetic Algorithm
E-learning model is examined of three major dimensions. And each dimension has a range of indicators that is effective in optimization and modeling, in many optimization problems in the modeling, target function or constraints may change over time that as a result optimization of these problems can also be changed. If any of these undetermined events be considered in the optimization process, t...
متن کاملرابطه نگرش، اطلاعات کلامی و مهارتهای ذهنی با پیشرفت تحصیلی درس علوم
The present study aimed to assess the structural relationship of intellectual skills, verbal information and attitude toward science with science achievement among 8th graders. The sample consisted of 700 student (350 male students and 350 girl students) from all 8th graders students in Tehran. The data were collected through a researcher-made test assessing intellectual skills-verbal Informat...
متن کاملThe Relationship between Information Literacy and Learning Motivation in Undergraduate students
Background and Aim: Information literacy is a key element in the development of an independent and effective learning in higher education in the 21st century. The aim of this study was to investigate the relationship between information literacy and learning motivation in undergraduate students. In addition, the mean score of information literacy was measured based on educational and demographi...
متن کاملEvaluation of Genetic Diversity in Japanese and English White Quail Populations Using Microsatellite Markers
The Japanese and English White quails are widespread strains and belongs to the Galliformes order, Phasianidae family, Coturnix genus and Japonica species. These birds are likely to be well-adapted to the hard conditions and resistance to diseases as it has attained economic importance as an agricultural species. In the current study, the genetic variation of Japanese and English White quail ...
متن کاملMachine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms
Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in information and computer science over the past few decades. In the 198Os, knowledge-based techniques also made an impressive contribution to “intelligent” information retrieval and indexing. More recently, information science researchers have turned to other newer artificial-in...
متن کامل