"Transparent but accurate evolutionary regression combining new linguistic fuzzy grammar and a novel interpretable linear extension".

Carmen Biedma-Rdguez, Augusto Anguita-Ruiz, Rafael Alcalá, Jesús Alcalá-Fdez and María José Gacto

Summary:

Preliminaries: Interpretability measures
Experimental Study
Analysis on some example linguistic models
Comparing to some state-of-the-art general purpose "accuracy oriented" methods

In the following sections, the complementary materials (datasets and detailed results) of the referred paper can be downloaded.

Even though there are a few new proposals on designing interpretable models for classification problems, they are not directly aplicable to regression problems where continous complex surfaces are quite difficult to be modeled by only a few rules that aims to separate the different values of the output variable. This is why our proposal actually aims to find and separate "tendencies" since they better catch the continous nature in regression problems. To our knowledge there are on recent proposals for obtaining interpretable and really simple FRBSs models (a few rules only) for regression.

A. Preliminaries: Interpretability measures considered to ensure the model comprehensibility

This part is complementary to the analogous section in the paper, in order to provide a sort description of the well-known metrics considered to ensure clear linguistic semantics. In the following two subsections, we briefly introduce the GM3M and RMI indexes devoted to assess the initial linguistic concepts preservation (proximity to the fully interpretable strong equidistributed fuzzy partition) and the rules consistency (ausence of contradiction), respectively.

A.1. Gm3m index for Semantic Interpretability at the DB level

Formally, the Gm3m (M. J. Gacto, et all (2010) IEEE Transactions on Fuzzy Systems: Integration of an Index to Preserve the Semantic Interpretability in the Multiobjective Evolutionary Rule Selection and Tuning of Linguistic Fuzzy Systems) index is defined as the geometric mean of three complementary metrics used to quantify the proximity between a given membership function and the function initially defined for the associated linguistic term (Equation I). Each of the three metrics takes into account, respectively, aspects such as the relative displacement of the membership function (δ), differences in the relative symmetry of their slopes (γ) and differences in the area (ρ) between both membership functions.

These metrics were defined to measure the interpretability when the original definitions of the membership functions need to be modified, which is essential for learning accurate/trustworthy models. It is intented that, with the use of the geometric mean, if one of the metrics takes very small values (low interpretability), the value of Gm3m will also take small values. Gm3m takes values in the range [0,1] with 0 being the lowest level of interpretability and 1 being the highest level.

The complete description, formulation and some examples on how computing these metrics can be found in (M. J. Gacto, et all (2010) IEEE Transactions on Fuzzy Systems: Integration of an Index to Preserve the Semantic Interpretability in the Multiobjective Evolutionary Rule Selection and Tuning of Linguistic Fuzzy Systems) (for the commonly used triangular membership functions) and (M. Galende, et all (2014) Information Sciences: Comparison and Design of Interpretable Linguistic vs. Scatter FRBSs: GM3M Generalization and New Rule Meaning Index (RMI) for Global Assessment and Local Pseudo-Linguistic Representation) (extension for other types of membership functions, based on considering core and slopes separately for each metric).

A.2. Rmi index for Semantic Interpretability at the RB level

Rmi (Rule Meaning Index) (M. Galende, et all (2014) Information Sciences: Comparison and Design of Interpretable Linguistic vs. Scatter FRBSs: GM3M Generalization and New Rule Meaning Index (RMI) for Global Assessment and Local Pseudo-Linguistic Representation) is a semantic interpretability index at the RB level that can be used together with Gm3m to optimize and compare the semantic interpretability of different linguistic FRBSs. This index aims to assess whether the linguistic model outputs are the same as those described by their rules in their respective activation zones (and more particularly in their cores or to certain degree, their α-cuts).

For a given linguistic FRBS, Rmi is computed as the worst case at the individual values of Rmi(R_i) of each rule R_i in the whole RB. Individually, i.e. for each rule, the goal of Rmi(R_i) is to evaluate the degree of reliability of the R_i rule with respect to the global output the whole model would infer in the activation zone of this rule (in its core or to certain degree, its α-cut ). Therefore, this index also takes into account the particular inference system used by the FRBS through the inferred output (which is also important since it could also affect the RB semantic interpretability).

The way to calculate each Rmi(R_i) is as follows:

A FRBS input must be defined as the n-dimensional fuzzy set by the cores, or the cores of the α-cuts, of the membership functions in the n antecedents of R_i. In this contribution we directly consider the cores (α=1.0).
Estimate the output considering the input generated in the previous step by inferring with the whole FRBS.
Compute the Rmi(R_i) value as the matching between the estimated output and the R_i consequent membership function: to measure how different the system and the local Ri outputs are.

Rmi is defined in the range [0,1], where 0 indicates the lowest level of reliability and 1 the highest. The complete description, formulation and some examples on how computing Rmi can be found in (M. Galende, et all (2014) Information Sciences: Comparison and Design of Interpretable Linguistic vs. Scatter FRBSs: GM3M Generalization and New Rule Meaning Index (RMI) for Global Assessment and Local Pseudo-Linguistic Representation).

B. Experimental Study

This section includes the experimental study on the proposed method. The experimentation is undertaken with 23 real-world datasets, with a number of variables within the interval [2, 60] and a number of examples within the interval [43, 4177]. In all the experiments, a 5-fold cross-validation model (5fcv) has been adopted, i.e., the data-set has been spitted randomly into 5 folds, each one containing the 20% of the patterns of the data-set. Thus, four folds have been used for training and one for testing. The properties of these datasets are presented in Table I: name of the dataset (NAME), short name or acronym of the dataset (ACRO), number of variables (VAR), and number of examples (CASES). For each data-set, the number of cases and the number of variables is shown. You may download all data-sets in the KEEL format by clicking here.

These datasets have been downloaded from the following web pages:

Table I: Properties of the datasets

NAME	ACRO	VAR	CASES
Abalone	ABA	8	4177
Anacalt	ANA	7	4052
Baseball	BAS	16	337
Boston housing	BOS	13	506
Diabetes	DIA	2	43
Machine CPU	CPU	6	209
Electrical Maintenance	ELE	4	1056
Body fat	FAT	14	252
Forest Fires	FOR	12	517
Friedman	FRI	5	1200
Mortgage	MOR	15	1049
Auto Mpg 6	MPG6	5	392
Auto Mpg 8	MPG8	7	392
AutoPrice	PRI	15	159
Quake	QUA	3	2178
Stocks domain	STP	9	950
Strike	STR	6	625
Treasury	TRE	15	1049
Triazines	TRI	60	186
Weather Ankara	WAN	9	1609
Weather Izmir	WIZ	9	1461
Wisconsin Breast Cancer	WBC	32	194
Yacht Hydrodynamics	YH	6	308

C. Analysis on some example linguistic models

In this subsection, we include some representative examples of the linguistic models obtained in two of the benchmark problems used for comparison in the previous subsections: WAN (Weather in Ankara) and WBC (Wisconsin Breast Cancer). Figures 2 and 3 depict both models in order to demonstrate not only the accuracy of the method but also the simplicity and easy reading of the rules obtained. Variables in these figures are ordered to represent the same order of splits in the tree generated when learning the rules. In this way, we can consider each split as a path to recognize the different divisions in the data from more general to more specific.

We have used colors to ease the recognition of the different cases represented by the rules (same color per variable and split). Grey texts are included only to provide additional information, but this information is actually not a part of the rule structure proposed (and therefore it is not needed for inference or for understanding). It is the same for the the percentage of covered instances, the Gm3m and the Rmi values, since they are purely informative on the semantic quality of each partition and rule, respectively. As we previously explained, Rmi goes from 1.0 (representing that what a single rule affirm in its main covering region is equal to what the model produces) to 0.0 (representing that what a single rule affirm is completely different from what the model produces). In general, we can see that almost all the rules are qualified with Rmi equal to 1.0, which indicates (together with the high Gm3m values) that these rules do not interfere significantly among them, so preserving each rule locality. Finally, please take into account that our initial linguistic partitions are strong ones (which are accepted in the specialized literature as highly interpretable), and that Gm3m values near 0.8 indicate that their meanings are preserved to a high level (see an example in Figure 1 and the Gm3m values reported in Figure 2). Again, we are showing the definition points of the membership functions only as additional information, instead of only including the linguistic terms, because of the expert in our real case study (children obesity problem) asked us about these numbers after analyzing the rules to check the approximated division values, so that we think they could be probably interesting for a possible expert in any of the problems. Please, skip these numbers if you are not really an expert on the given problem and remember that the corresponding linguistic terms come from pretty much a strong linguistic partition.

Figure 1: Example of linguistic partition with Gm3m equal to 0.81 (blue), with respect to the corresponding strong fuzzy partition (grey)

Figure 2: KB obtained with the method proposed in the WAN dataset. MSETst obtained is 1.565

Figure 2 shows the DB and RB obtained for the WAN dataset (Estimation of average temperature from measured climate factors), whose accuracy (MSETst) obtained is 1.565. The First division (by MinTemp) achieves three different situations depending on the minimum temperature values (colder, medium and hottest situations). Taking into account the easiest one (R5, hottest), it determines that, when minimum temperatures are from very high, the mean temperature should be high (centered on 68.6°F) and moving up (or down) depending on the maximum temperature by 0.71 per degree over (or under) 82.2°F. In the cases where the minimum temperature is medium (R3 and R4), we find two different situations depending on the dew point. Where the dew point is high or over, the mean temperature should be medium (centered on 57.7°F). And where the dew point is up to medium, the mean temperature should be a little less, i.e. between low and medium (centered on 34.2°F). In both cases, variability is once again explained in the maximum temperature variations, where depending on the dew point we can see how these maximum temperatures are moving in different ranges (55.5 with respect to 39.9 as their respective average points for adding or subtracting). We can see also at this moment that variability depending on the maximum temperatures is higher in the case described by R5 than those in the cases described by R3 and R4 (when temperatures are high, in general, changes to the maximum temperature affects the mean value estimation more). This kind of relative information among consequent factors can not be found (or it is not easy to be found) in the models obtained by using the classic linguistic rules. Which makes it a new, additional and useful piece of information that has never been seen before in previous linguistic fuzzy proposals. Finally, the cases where minimum temperature is up to low (R1 and R2, colder cases) could be analyzed in the same manner by taking into account that both rules are depending on visibility (clear or not day) and varying on different factors (on maximum temperatures for clear days or on dew point for foggy days).

Figure 3: KB obtained with the method proposed in the dataset WBC. MSETst obtained is 640.9

The linguistic model obtained for the WBC dataset (predicting the months when breast cancer is likely to recur cased in characteristics of individual cells from images) is shown in Figure 3. The obtained model is quite interesting as with only 3 rules, it obtains very precise results with respect to those obtained by methods in our comparisons. In this case, we leave the interpretation up to the reader, who should take into account that: The texture of the cell nucleus is measured on the variance of the gray scale intensities (i.e., the higher the uglier, so that they are more malignant); and Fractal Dimension is the approximation by the coastline (i.e., the higher the more approximated, so that contours are more regular and therefore more benignant). R3 represents the cases with the highest severity, R2 represents the cases with the least severity and R1 the intermediate cases.

D. Comparing to some state-of-the-art general purpose "accuracy oriented" methods

While accuracy is not the main focus of the article, the proposed algorithm is also compared to some highly accurate state-of-the-art algorithms (that are available in recognized software tools such as JSAT: Java Statistical Analysis Tool, a Library for Machine Learning available on the link, R: A Language and Environment for Statistical Computing available on the link, Scikit-learn: Machine Learning in Python available on the link and Matlab M5PrimeLab: M5' regression tree, model tree, and tree ensemble toolbox for Matlab/Octave available on the link, in order to help the readers appreciate what is the achieved accuracy as compared to other methods in the literature (simply, for benchmarking). The representative algorithms that we consider in this contribution are shown in Table II. This table shortly describes these algorithms providing their corresponding literature reference. In relation to the algorithmic parameters, we are considering the standard ones recommended by authors (those included in each tool as recommended parameters by default). However, in the case of the number of total trees in the Random Forest based algorithm, it is not 500 by default. Setting up this value to 500 improved the results systematically, without significant improvements far beyond this value, so therefore we fixed it to 500 for this comparison. Finally, since our MSE is divided by 2, we multiplied our results by 2 to perform this comparison.

Table II: Algorithms considered as representatives of more accurate not transparent approaches

Algorithm Type	Cite	Link	Description
Model Trees (MT)	M5PrimeLab: M5' regression tree, model tree, and tree ensemble toolbox for Matlab/Octave.	link	M5 prime regression method implementation
Neural Networks (NNET)	Adam: A Method for Stochastic Optimization.	link	MLP squared-loss stochastic gradient (100 hidden neurons)
Random Forests (RF)	Gene selection with guided regularized random forest.	link	Regularized random forest algorithm with 500 trees
Support Vector Machines (SVM)	Large-Scale Linear Support Vector Regression.	link	Dual coordinate descent for large-scale linear SVM

These algorithms and the 23 regression datasets are publicly available, so for the sake of simplicity we will directly provide the statistical test results. Table III shows the rankings using Friedman's test of the different methods considered in this study in test error. In this case, the proposed algorithm is ranked second behind RF, which seems to have performed quite well.

Table IV shows the adjusted p-values (apv) obtained using Holm's test, comparing all the methods versus the proposed method in test error. The results show that the proposed method outperforms those methods that are ranked below with low apvs (0.128 in the closest case). On the other hand, we can observe an apv which is quite a lot higher in comparison to RF, indicating that the results between these two approaches are not so far apart.

Table III: Algorithms considered as representatives of more accurate not transparent approaches

Algorithm	Ranking
RF	1.609
Proposed method	2.174
MT	3.174
SVM	3.522
NNET	4.522

Table IV: Adjusted p-values using Holm's test. Proposed Method versus all on Tst.

Algorithm	apv on Tst
Proposed vs NNET	4.289E-6
Proposed vs SVM	0.023
Proposed vs MT	0.128
Proposed vs RF	0.451

As previously mentioned, while accuracy is not the main objective of the article, in our opinion and taking into account that the proposed approach obtains less than 7 rules in all the datasets (less than 5 on average), these results show a really competitive performance also from an accuracy point of view. It competes alright with models adjusted over 500 trees.

Home Rafael Alcalá Fernández
Last update: 18/06/2021	Optimized for MS-Explorer with 1024 x 768 pixeles resolution

Complementary materials for the paper