Complementary materials for the paper


"Experimental study on 164 algorithms available in software tools for solving standard non-linear regression problems".

published in IEEE Access


by

María José Gacto, J. M. Soto-Hidalgo, Jesús Alcalá-Fdez, and Rafael Alcalá

Summary:

  1. Experimental Study
  2. Software used for the experiments
  3. Complete results
  4. Analysis on the curse of Dimensionality

In the following sections, the complementary materials (datasets and detailed results) of the referred paper can be downloaded. Please, if you use any of them, cite us using the following reference:
  • María José Gacto, J. M. Soto-Hidalgo, Jesús Alcalá-Fdez, and Rafael Alcalá. Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems, IEEE Access 7 (2019) pp. 108916-108939 (download the paper and/or citation bibtex reference by doi 10.1109/ACCESS.2019.2933261)

 

 

A. Experimental Study

The main aim of this experimental study is to analyze different regression methods. The experimentation is undertaken with 52 real-world datasets, with a number of variables within the interval [2, 60] and a number of examples within the interval [43, 45730]. In all the experiments, a 5-fold cross-validation model (5fcv) has been adopted, i.e., the data-set has been spitted randomly into 5 folds, each one containing the 20% of the patterns of the data-set. Thus, four folds have been used for training and one for testing. The properties of these datasets are presented in Table I: short name of the dataset (NAME), number of variables (VAR), and number of examples (EXAMPLES). For each data-set, the number of cases and the number of variables is shown. You may download all data-sets in the Weka format ( .arff) by clicking here.

These datasets have been downloaded from the following web pages:

 

Table I: Properties of the datasets

NAME VAR EXAMPLES NAME VAR EXAMPLES NAME VAR EXAMPLES
2DPLANES 10 40768 DELTAAIL 5 7129 MPG8 7 392
ABA 8 4177 DELTAELV 6 9517 MV 10 40768
ADD10 10 9792 DIABETES 2 43 PLA 2 1650
AIL 40 13750 DIAMOND 18 308 POLE 26 14998
AIRFOIL 5 1503 ELE1 2 495 PUMA32 32 8192
ANA 7 4052 ELE2 4 1056 PUMA8 8 8192
AUTOPRICE 15 159 ELV 18 16599 PYRIM 27 74
BANK32 32 8192 FAT 14 252 QUA 3 2178
BANK8 8 8192 FOR 12 517 STO 9 950
BAS 16 337 FRIED 5 1200 STRIKES 6 625
BOSTON 13 506 HOUSE16 16 22784 TRE 15 1049
CA 21 8192 HOUSE8 8 22784 TRIAZ 60 186
CAL 8 20640 KINE32 32 8192 WA 9 1609
CASP 9 45730 KINE8 8 8192 WI 9 1461
CCPP 4 9568 LASER 4 993 WPBC 32 194
CONCRETE 8 1030 MACHINECPU 6 209 YH 6 308
CPU_SMALL 12 8192 MOR 15 1049
DEE 6 365 MPG6 5 392

 

 

B. Software used for the experiments

Six software tools for the analyzed algorithms in the experimental study have been used. These software tools and a small description can be found below:

  • JSAT: Java Statistical Analysis Tool, a Library for Machine Learning available on the link
  • KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining available on the link
  • Matlab available on the link . Moreover, we use several toolboxes implemented by Gints Jekabsons. The toolboxes codes are open source regression software for Matlab/Octave and are licensed under the GNU GPL license. These toolboxes are available on link are:
    • ARESLab: Adaptive Regression Splines toolbox for Matlab/Octave
    • M5PrimeLab: M5' regression tree, model tree, and tree ensemble toolbox for Matlab/Octave
    • PRIM: Bump Hunting using Patient Rule Induction Method for Matlab/Octave
  • Scikit-learn: Machine Learning in Python available on the link
  • R: A Language and Environment for Statistical Computing available on the link
  • The Weka data mining software available on the link

 

C. Complete results

The complete results obtained by the 164 studied methods in all the datasets can be found in a downloadable spreadsheet. The results are grouped in tables by algorithms where each table shows the average of the results obtained by each algorithm in all the studied datasets. For each algorithm, the first four columns show the average MSE in training and testing data (MSETra/MSETst) together with their standard deviations ( SDs) respectively, and the last column shows the average computational cost in seconds (AvTime).

Notice that some colors are included in the headers of each table in order to highlight the different software by categories. For example, the blue one represents the methods that are available in R.

Complete results in .xlsx format can be downloaded here

D. Analysis on the curse of Dimensionality

Here we provide with two spreadsheet including the complete results obtained by the 164 studied methods sorted by Friedman's ranking when only High Dimensional datasets are considered (>=9 variables) and the same when only Low Dimensional datasets are considered (<9 variables). The same is provided when T2, from the data complexity framework, is used for separation of the datasets (T2 >=250 or T2 <250, respectively).

Complete results separated by Dimensionality in .xls format can be downloaded here

Complete results separated by T2 in .xls format can be downloaded here


© Universidad de Jaén. Webmaster: María José Gacto Colorado  mgacto.