"Experimental study on 164 algorithms available in software tools for solving standard non-linear regression problems".

published in IEEE Access

María José Gacto, J. M. Soto-Hidalgo, Jesús Alcalá-Fdez, and Rafael Alcalá

Summary:

Experimental Study
Software used for the experiments

Complete results
Analysis on the curse of Dimensionality

In the following sections, the complementary materials (datasets and detailed results) of the referred paper can be downloaded. Please, if you use any of them, cite us using the following reference:

María José Gacto, J. M. Soto-Hidalgo, Jesús Alcalá-Fdez, and Rafael Alcalá. Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems, IEEE Access 7 (2019) pp. 108916-108939 (download the paper and/or citation bibtex reference by doi 10.1109/ACCESS.2019.2933261)

A. Experimental Study

The main aim of this experimental study is to analyze different regression methods. The experimentation is undertaken with 52 real-world datasets, with a number of variables within the interval [2, 60] and a number of examples within the interval [43, 45730]. In all the experiments, a 5-fold cross-validation model (5fcv) has been adopted, i.e., the data-set has been spitted randomly into 5 folds, each one containing the 20% of the patterns of the data-set. Thus, four folds have been used for training and one for testing. The properties of these datasets are presented in Table I: short name of the dataset (NAME), number of variables (VAR), and number of examples (EXAMPLES). For each data-set, the number of cases and the number of variables is shown. You may download all data-sets in the Weka format ( .arff) by clicking here.

These datasets have been downloaded from the following web pages:

Table I: Properties of the datasets

NAME	VAR	EXAMPLES	NAME	VAR	EXAMPLES	NAME	VAR	EXAMPLES
2DPLANES	10	40768	DELTAAIL	5	7129	MPG8	7	392
ABA	8	4177	DELTAELV	6	9517	MV	10	40768
ADD10	10	9792	DIABETES	2	43	PLA	2	1650
AIL	40	13750	DIAMOND	18	308	POLE	26	14998
AIRFOIL	5	1503	ELE1	2	495	PUMA32	32	8192
ANA	7	4052	ELE2	4	1056	PUMA8	8	8192
AUTOPRICE	15	159	ELV	18	16599	PYRIM	27	74
BANK32	32	8192	FAT	14	252	QUA	3	2178
BANK8	8	8192	FOR	12	517	STO	9	950
BAS	16	337	FRIED	5	1200	STRIKES	6	625
BOSTON	13	506	HOUSE16	16	22784	TRE	15	1049
CA	21	8192	HOUSE8	8	22784	TRIAZ	60	186
CAL	8	20640	KINE32	32	8192	WA	9	1609
CASP	9	45730	KINE8	8	8192	WI	9	1461
CCPP	4	9568	LASER	4	993	WPBC	32	194
CONCRETE	8	1030	MACHINECPU	6	209	YH	6	308
CPU_SMALL	12	8192	MOR	15	1049
DEE	6	365	MPG6	5	392

B. Software used for the experiments

Six software tools for the analyzed algorithms in the experimental study have been used. These software tools and a small description can be found below:

JSAT: Java Statistical Analysis Tool, a Library for Machine Learning available on the link
KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining available on the link
Matlab available on the link . Moreover, we use several toolboxes implemented by Gints Jekabsons. The toolboxes codes are open source regression software for Matlab/Octave and are licensed under the GNU GPL license. These toolboxes are available on link are:
- ARESLab: Adaptive Regression Splines toolbox for Matlab/Octave
- M5PrimeLab: M5' regression tree, model tree, and tree ensemble toolbox for Matlab/Octave
- PRIM: Bump Hunting using Patient Rule Induction Method for Matlab/Octave
Scikit-learn: Machine Learning in Python available on the link
R: A Language and Environment for Statistical Computing available on the link
The Weka data mining software available on the link

C. Complete results

The complete results obtained by the 164 studied methods in all the datasets can be found in a downloadable spreadsheet. The results are grouped in tables by algorithms where each table shows the average of the results obtained by each algorithm in all the studied datasets. For each algorithm, the first four columns show the average MSE in training and testing data (MSETra/MSETst) together with their standard deviations ( SDs) respectively, and the last column shows the average computational cost in seconds (AvTime).

Notice that some colors are included in the headers of each table in order to highlight the different software by categories. For example, the blue one represents the methods that are available in R.

Complete results in .xlsx format can be downloaded here

D. Analysis on the curse of Dimensionality

Here we provide with two spreadsheet including the complete results obtained by the 164 studied methods sorted by Friedman's ranking when only High Dimensional datasets are considered (>=9 variables) and the same when only Low Dimensional datasets are considered (<9 variables). The same is provided when T2, from the data complexity framework, is used for separation of the datasets (T2 >=250 or T2 <250, respectively).

Complete results separated by Dimensionality in .xls format can be downloaded here

Complete results separated by T2 in .xls format can be downloaded here

Complementary materials for the paper