Rough Set Theory for Discretization Based on Boolean Reasoning and Genetic Algorithm
نویسندگان
چکیده
Real world datasets may be continuous. Many data analysis algorithms work efficiently on discrete data while some other algorithms work only on discrete data. Thus the continuous datasets are discretized as a pre-process step to knowledge acquisition. Attribute discretization is the process of reducing the domain of a continuous attribute with irreducible and optimal set of cuts, while preserving the consistency of the dataset classification. In this paper, we use discernibility relations of Rough Set Theory (RST) and propose a 2-step discretization process, where the set of cuts returned from MD-Heuristics approach are further reduced using Genetic Algorithm (GA). Experiments on datasets from UCI Machine Learning Repository show that the proposed discretization process is efficient in finding a consistent and irreducible set of cuts. Keywords Discretization, Genetic Algorithm, Rough Set Theory 1. Introduction Any real world data can be represented and stored in the form of an information table, also known as decision table. All rows of the decision table called objects or examples make up knowledge, and are described by a http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 set of properties called attributes. Analysis of decision table, to extract patterns and for classification of objects, is an important task in data mining, knowledge discovery, decision analysis, machine learning and pattern recognition. Quality of data analysis algorithms on decision tables with continuous attributes is good, when attribute domain is small, as some algorithms work only on discrete data, while some other algorithms work efficiently on discrete data. The process of reducing the domain of continuous attributes is called discretization, and is achieved by replacing the domain of the continuous attributes with a finite number of discrete intervals. Various discretization approaches have been described in literature [2], [3], [8], [6]. Discretization of continuous attributes is shown to be a NP-hard problem in [9], [10], [7] by characterizing the computational complexity of the problem in terms of RST discernibility relations. Heuristic discretization methods based on RST discernibility relations and boolean reasoning are well studied in [16], [12] giving a suboptimal solution. We propose a 2-step discretization approach, using RST discernibility relations, MDHeuristics approach and Genetic Algorithm. Discretization can also be achieved using GA alone, but the search space will be huge for large datasets and more time will be taken to find optimal set of cuts [1]. Thus we are reducing the GA search space by considering only the set of cuts returned from MD-Heuristics. In this discretization process, all the superfluous cuts from MD-heuristics will be reduced using GA, thus consistent and irreducible set of cuts are identified with in less time. The rest of the paper is organized as follows: Section 2 introduces basic concepts of RST, Section 3 describes the concepts of discretization and discernibility matrix discretization approach and MD-Heuristics approach, Section 4 describes the proposed 2-step discretization approach, Section 5 describes experimental results on different datasets and Section 6 describes the conclusion of the paper. 2. Rough Sets RST was introduced by Pawlak in 1982 [17], [14], [13], a mathematical methodology in data analysis, to handle uncertain information in data sets. RST carries through challenging tasks like attribute reduction, attribute discretization, identifying patterns in data, computation of attribute relevance and dataset characterization. Some of the applications of RST include machine learning, data mining, decision analysis, pattern recognition and knowledge discovery [11]. Decision table is defined as a 5 tuple = ( , , , , ), where is the universe of all objects , , , ... , is the set of all conditional attributes , , , ... , is the decision attribute, ∉ , decides the class of an object, let , , , ... be distinct decisions in , = ⋃ ∈ , where is the domain of the conditional attribute , : × → is a mapping function, where ( , ) represents a value for object on attribute in the domain . The principal idea of RST is indiscernibility relation . Object is said to be indiscernible from object , if they have same values for all attributes in !. is also known as equivalence relation as it satisfies reflexive, symmetric and transitive properties. = ( , ) ∈ × | ( , #) = ( , #), ∀ # ∈ !, ! ⊆ Equivalence class of an object , & '() is defined as the set of objects those that are indiscernible from . Decision class of a decision is defined as, the set of all objects with as their decision. Decision class of is denoted by * , http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 * = + | ( +, ) = , ∀ + ∈ The set of all decision classes, , partitions . , = * , * , * , ... , = * Rough set of a decision class * on any subset of conditional attributes ! ⊆ , is defined by a lower approximation and an upper approximation. Lower approximation of * on ! is defined as, the set of objects that are certainly belonging to decision class * . Upper approximation of * on ! is defined as, the set of objects that may belong to decision class * . #./ (* ) = 0 1 2() ∣∣∣ 1 2() ⊆ * 4 #./ (* ) = 0 1 2() ∣∣∣ 1 2() ∩ * ≠ 7 4 Consistency of a decision table is defined in terms of a generalized decision function. A generalized decision function of an object on a set of conditional attributes !, ! ⊆ , is defined as the set of decisions of all the objects in the equivalence class of . 8 ∶ → 2;< and 8 ( ) = = > , ? @ ∈ & '()A where B = ( , ) = , , , ... Decision table is said to be consistent, if the cardinality of 8 is 1 for all the objects in . = C DEFEGHDG, |∂J( )| = 1, ∀ ∈ U FD DEFEGHDG, GhH/NFEH O 3. Discretization of continuous attributes Let be a consistent decision table and let = &P , / ) ⊂ R, where ∈ , R is the set of real numbers and P < / . Any pair ( , T) is called a cut on , where T ∈ . Definition 1[12]. The set of basic cuts on an attribute ∈ , denoted by U , is defined as U = VW , (T + T ) 2 Y , W , (T + T ) 2 Y , ... , W , (TZ[ + TZ ) 2 Y\ where T < T < ... < TZ] is a sequence of continuous values defined by and ( , ) = =T , T , ... , TZ] A Let U be the set of all basic cuts defined on all conditional attributes http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 U = U ∈ Definition 2[12]. A new decision system ^ − FE /HGF`#GF D , is defined as a 6 tuple a = ( , , , ^, a , a), where ^ is the set of cuts ^ = ̂| ̂ = . , . , ... , .Z , . < . < ... < .Z ∈ a( , ) = b 0, ( , ) < . F, ( , ) ∈ &. , . d ), 1 ≤ F ≤ f − 1 f + 1, ( , ) > .Z O Quality of ^ − FE /HGF`#GF D is defined as the ratio of the number of all objects in lower approximation to the total number of objects in . i = #./a(,) | | , NhH/H #./a(,) = #./ (* ) jk∈l Definition 3[12]. A set of cuts ^, is called − DEFEGHDG if 8 = 8a , where 8 and 8a are the generalized definition functions of and a respectively. Definition 4[12]. A set of cuts ^, is called − F//H m FnPH if ^ is − DEFEGHDG and for any ^o, ^o ⊂ ^, ^′ is not − DEFEGHDG. Definition 5[12]. A set of cuts ^, is called − .GFq#P if ^ is − DEFEGHDG and for any − DEFEGHDG set of cuts ^o, |^| ≤ |^o| Let ∗ = ∗, U, ∗ be an information table, where ∗ be the set of pairs (F, +), such that F < + and ( , ) ≠ ( , ), U is the set of all basic cuts on , ∗ is a mapping function ∗: ∗ × U → 0 / 1 , ∗((F, +), ptu) = C1, ( , c) < ptu ≤ > , c? or ( , c) < ptu ≤ ( , c) 0, otherwise O where ∈ , . ∈ U . The set of − DEFEGHDG, − F//H m FnPH and − .GFq#P cuts can be generated from ∗. 3.1 Discernibility matrix approach to discretization Discernibility matrix is introduced by Skowron and Rauszer [15]. Let = ∗, ∗ be a discernibility matrix. Rows of the matrix ∗, are the set of pairs (F, +), such that F < + and ( , ) ≠ > , ?, columns of the matrix ∗, are the intervals &TZ , TZ d ), ∀ ∈ and 1 ≤ f < | |. Element of the matrix is defined as, http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 ( ) = 1, &TZ , TZd ) ⊆ qFD ( , ), > , ? , q# ( , ), > , ? 0, GhH/NFEH O The discernibility function, is defined as () = ⋀=⋁ A. The prime implicant of (), is of the form 0TZ , TZd , ... , TZ , TZd 4. These prime implicants defines set of cuts of the form, ^ = b , TZ , TZd 2 , ... , , TZ , TZd 2 ^ is − DEFEGHDG and − F//H m FnPH set of cuts, and the minimal of all such ^'s is the − .GFq#P set of cuts. Searching for − .GFq#P set of cuts is a NP-hard problem. Efficient heuristics will help in identifying optimal set of cuts. The next section describes MD-heuristics, a heuristic based approach to find reasonable set of cuts. 3.2 MD-heuristics approach to discretization Using MD-heuristics, the best set of cuts can be found in (| || |) steps, with (| || |) memory usage [12]. Consider the information table ∗. In this approach the column in ∗, with maximum number of 1's is added to the set of cuts, then that column is deleted from ∗, together with all rows with 1 in that column. This process is repeated till ∗ is empty. The best set of cuts obtained using this approach might not be a − .GFq#P set of cuts, it may include superfluous cuts which have to be removed. The next section describes a 2-step discretization approach to generate an − DEFEGHDG and − F//H m FnPH set of cuts. This approach is based on the set of cuts returned from MD-heuristics and Genetic Algorithm. 4. 2-step Discretization approach In this approach, all the superfluous cuts generated by MD-heuristics will be reduced using GA. 2-step Discretization approach is as shown in Figure 1. http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 Fig.1. 2-Step discretization approach 4.1 Genetic Algorithm GA provides a methodology to solve optimization problems. GA is motivated by biological evolution [5] and is stochastic in searching a huge search space. The proposed GA starts with the best set of cuts, ^ generated by MD-heuristics approach, superfluous cuts are identified in ^ and are discarded through GA iterations and finally − DEFEGHDG and − F//H m FnPH set of cuts are generated. If there are no superfluous cuts in ^, then GA will return the same set of cuts ^. 4.1.1 Representation Let ^ be the set of cuts returned from MD-heuristics approach. Each candidate set of cuts, is represented as a chromosome by a bit string of length |^|. A 1 in the bit string at position F represents, the cut . is present in the candidate set of cuts and a 0 in the bit string at position F represents the cut . is not present in the candidate set of cuts. 4.1.2 Initial Population Initial population includes the chromosome representation of ^, i.e., a bit string of all 1's of length |^| along with randomly generated chromosomes. Size of population is set to 20. http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 4.1.3 Fitness Function Fitness function is designed in such a way to discard any superfluous cuts from ^, while being − DEFEGHDG. Consider the information system ∗. Fitness function of a candidate set of cuts ^′ is denoted by ( ha ). ( ha) = ||[@a@ || , ⋀=⋁ ∗>(F, +), . ?A = 1 ∀. ∈ ^o 0, GhH/NFEH O (1) where U is the basic set of cuts and |^′| is the total number of 1's in the chromosome ha 4.1.4 Proof of Fitness Function The correctness of fitness function for attribute discretization, defined in Equation 1 is proved in Theorem 1. Theorem 1. Candidate set of cuts with maximum fitness value is − DEFEGHDG and − F//H m FnPH. Proof. A chromosome will get a positive fitness value, only if set of cuts defined by the chromosome is − DEFEGHDG, otherwise fitness value will be zero. Let ̂o and ̂o be two − DEFEGHDG chromosomes with fitness values > ha? and > ha? respectively and let > ha? < > ha? ⇒ |U| − | ̂o| |U| < |U| − | ̂o| |U| ⇒ | ̂o| > | ̂o| ̂o has less number of cuts and greater fitness value than ̂o. Therefore the candidate with maximum fitness value will be − DEFEGHDG and with no superfluous cuts in it i.e., − F//H m FnPH. 4.1.5 Algorithm The GA with the proposed fitness function is as shown in Algorithm 1. The algorithm is run with the GA parameter settings as shown in Table 1. Table 1. GA parameter settings Parameter name Parameter value Population size 20 Population type Bit String Creation function Uniform Scaling function Rank Selection function Stochastic uniform http://www.ijccr.com VOLUME 2 ISSUE 1 JANUARY 2012 Elite count 2 Crossover fraction 0.8 Crossover function Scattered Mutation rate 0.01 Stopping criteria Avg. change in fitness value < 10[ Algorithm 1: Algorithm to find − DEFEGHDG and − F//H m FnPH set of cuts using GA Input: Population type, population size, creation function, scaling function, selection function, elite count, crossover rate, fitness function, crossover function, mutation function, iteration count Output: − DEFEGHDG and − F//H m FnPH set of cuts Begin 1: Search space of GA is 2 − 1 possible candidate set of cuts, where D is the number of cuts returned by MDheuristics. 2: Include the set of cuts returned by MD-heuristics in Initial population. 3: Use uniform creation function and generate the rest of the initial population of given population size. 4: Evaluate fitness value, for each candidate set of cuts by using the fitness function given in Equation (1). 5: Candidate set of cuts are sorted as per fitness value. Rank is assigned to each candidate basing on its position in the sorted list. 6: Repeat steps 7-10, until the average change in the fitness value is less than 10[ 7: Generate offspring's by using scattered crossover function with crossover rate as 0.8. In scattered crossover function a random binary vector is created. Offspring's are generated by taking genes from first parent where the binary vector is 1 and genes from the second parent where the binary vector is 0. Include these offspring's in the next generation. 8: Apply mutation operator to the candidates in next generation, with mutation rate of 0.01. 9: Evaluate fitness value for each candidate set of cuts of the next generation. 10: Elite count is taken as 2, so top 2 fittest candidates are guaranteed to survive in the next generation. 11: Output the candidate set of cuts, that has the maximum fitness value from the current population.
منابع مشابه
Discovering Stock Price Prediction Rules Using Rough Sets
The use of computational intelligence systems such as neural networks, fuzzy set, genetic algorithms, etc. for stock market predictions has been widely established. This paper presents a generic stock pricing prediction model based on rough set approach. To increase the efficiency of the prediction process, rough sets with Boolean reasoning discretization algorithm is used to discretize the dat...
متن کاملA Rough Set-based Knowledge Discovery Process
The knowledge discovery from real-life databases is a multi-phase process consisting of numerous steps, including attribute selection, discretization of realvalued attributes, and rule induction. In the paper, we discuss a rule discovery process that is based on rough set theory. The core of the process is a soft hybrid induction system called the Generalized Distribution Table and Rough Set Sy...
متن کاملA Generic Scheme for Generating Prediction Rules Using Rough Set
This chapter presents a generic scheme for generating prediction rules based on rough set approach for stock market prediction. To increase the efficiency of the prediction process, rough sets with Boolean reasoning discretization algorithm is used to discretize the data. Rough set reduction technique is applied to find all the reducts of the data, which contains the minimal subset of attribute...
متن کاملApproximate Boolean Reasoning Approach to Rough Sets and Data Mining
Many problems in rough set theory have been successfully solved by boolean reasoning (BR) approach. The disadvantage of this elegant methodology is based on its high space and time complexity. In this paper we present a modified BR approach that can overcome those difficulties. This methodology is called the approximate boolean reasoning (ABR) approach. We summarize some most recent application...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملA Novel Recommender System Based on Fuzzy Set and Rough Set Theory
Recommender System is an effective means of handling information overload and can provide personalized service as a useful information tool in e-commerce. In this paper, a novel automatic recommender system is proposed based on fuzzy c-means algorithm and rough set theory, including three main steps: data discretization, rules establishing and fuzzy reasoning. A method for fitting the results o...
متن کامل