Big Data Mining, the availability of effective and efficient classifiers is a prime concern. Accurate classification results can be obtained with sophisticated models, e.g. using ensembling approaches and exploiting concepts of fuzzy set theory, but with an high computational cost. The quest for efficiency leads to the adoption of distributed versions of classification algorithms, and in this effort the support of proper cluster computing frameworks can be fundamental. In this paper it is proposed DFRF, a novel distributed fuzzy random forest induction algorithm, based on a fuzzy discretizer for continuous attributes. The described approach, although shaped on the MapReduce programming model, takes advantage of the implicit distribution of the computation provided by the Apache Spark framework. An extensive experimental characterization of the algorithm over Big Datasets, along with a comparison with other state-of-the-art fuzzy classification algorithms, shows that DFRF provides very competitive results; moreover, a scalability study carried out on a small computer cluster shows that the approach is well behaved with respect to an increment in the number of available computing units.

Implicitly Distributed Fuzzy Random Forests

Marco Barsacchi;Alessio Bechini
;
Francesco Marcelloni
2021-01-01

Abstract

Big Data Mining, the availability of effective and efficient classifiers is a prime concern. Accurate classification results can be obtained with sophisticated models, e.g. using ensembling approaches and exploiting concepts of fuzzy set theory, but with an high computational cost. The quest for efficiency leads to the adoption of distributed versions of classification algorithms, and in this effort the support of proper cluster computing frameworks can be fundamental. In this paper it is proposed DFRF, a novel distributed fuzzy random forest induction algorithm, based on a fuzzy discretizer for continuous attributes. The described approach, although shaped on the MapReduce programming model, takes advantage of the implicit distribution of the computation provided by the Apache Spark framework. An extensive experimental characterization of the algorithm over Big Datasets, along with a comparison with other state-of-the-art fuzzy classification algorithms, shows that DFRF provides very competitive results; moreover, a scalability study carried out on a small computer cluster shows that the approach is well behaved with respect to an increment in the number of available computing units.
2021
9781450381048
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1066477
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact