Random forests are currently considered among the most accurate and efficient classifiers. Moreover, recently fuzzy implementations of random forests have been proposed to exploit the ability of fuzzy decision trees to cope with uncertain data. Whenever the size of training sets grows substantially, as it happens in the case of Big Data, ordinary implementations of classifiers become inadequate, and fuzzy random forests make no exception. In this paper, we consider a method, which generates fuzzy partitions of the continuous attributes along the decision tree learning, and we propose a distributed implementation of fuzzy random forests based on this method. The implementation relies on the MapReduce programming model and the Apache Hadoop framework. It is shown that such a model can easily accommodate an effective distribution strategy for the computation, yielding good scalability figures. The novel distributed algorithm makes fuzzy random forests able to deal with extremely large data sets, both in the learning and in the classification phases, thus fostering its applicability in the modern scenario of increasingly frequent data deluges.
Spreading Fuzzy Random Forests with MapReduce
BECHINI, ALESSIO;MARCELLONI, FRANCESCO;SEGATORI, ARMANDO
2016-01-01
Abstract
Random forests are currently considered among the most accurate and efficient classifiers. Moreover, recently fuzzy implementations of random forests have been proposed to exploit the ability of fuzzy decision trees to cope with uncertain data. Whenever the size of training sets grows substantially, as it happens in the case of Big Data, ordinary implementations of classifiers become inadequate, and fuzzy random forests make no exception. In this paper, we consider a method, which generates fuzzy partitions of the continuous attributes along the decision tree learning, and we propose a distributed implementation of fuzzy random forests based on this method. The implementation relies on the MapReduce programming model and the Apache Hadoop framework. It is shown that such a model can easily accommodate an effective distribution strategy for the computation, yielding good scalability figures. The novel distributed algorithm makes fuzzy random forests able to deal with extremely large data sets, both in the learning and in the classification phases, thus fostering its applicability in the modern scenario of increasingly frequent data deluges.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.