Big Data Mining, the availability of effective and efficient classifiers is a prime concern. Accurate classification results can be obtained with sophisticated models, e.g. using ensembling approaches and exploiting concepts of fuzzy set theory, but with an high computational cost. The quest for efficiency leads to the adoption of distributed versions of classification algorithms, and in this effort the support of proper cluster computing frameworks can be fundamental. In this paper it is proposed DFRF, a novel distributed fuzzy random forest induction algorithm, based on a fuzzy discretizer for continuous attributes. The described approach, although shaped on the MapReduce programming model, takes advantage of the implicit distribution of the computation provided by the Apache Spark framework. An extensive experimental characterization of the algorithm over Big Datasets, along with a comparison with other state-of-the-art fuzzy classification algorithms, shows that DFRF provides very competitive results; moreover, a scalability study carried out on a small computer cluster shows that the approach is well behaved with respect to an increment in the number of available computing units.
|Titolo:||Implicitly Distributed Fuzzy Random Forests|
|Anno del prodotto:||2021|
|Appare nelle tipologie:||4.1 Contributo in Atti di convegno|