The development of two classification model (CM) batteries capable of discerning the ability of small molecules to inhibit tubulin polymerization is described. Approximately 550 compounds collected from the literature were utilized to calculate approximately 1000 molecular descriptors. After randomly culling 50 compounds to serve as a subsequent prediction set (PS) for validation, the remainder was used to set up two datasets, one where molecules were considered active or not with an IC50 threshold of 10 μM and the other with an IC50 threshold of 1 μM. Each dataset was rationally split into training sets (TR) and test sets (TS). Several hundred CMs were obtained using different TR sets, many different “decision tree” algorithms and different end-point thresholds for binary classification. The relevant TS sets were used to assess model performance and to reduce the number of models to 15 for each of the two IC50-threshold datasets. The rigorously validated models were further tested for their predictive capability on the prediction set. Although individual models that proved to have the best predictive capability would be useful, we found that using the entire battery of 15 models for each of the datasets strengthens the predictive power significantly — approaching 100% certainty for molecules within the applicability domain of the models. These CM batteries should be quite valuable to assess the potential any new chemical entity proposed for synthesis as an inhibitor of tubulin polymerization in an anticancer drug discovery program.
Development of Classification Model batteries for predicting inhibition of tubulin polymerization by small molecules
MASSARELLI, ILARIA;BIANUCCI, ANNA MARIA PAOLA
2011-01-01
Abstract
The development of two classification model (CM) batteries capable of discerning the ability of small molecules to inhibit tubulin polymerization is described. Approximately 550 compounds collected from the literature were utilized to calculate approximately 1000 molecular descriptors. After randomly culling 50 compounds to serve as a subsequent prediction set (PS) for validation, the remainder was used to set up two datasets, one where molecules were considered active or not with an IC50 threshold of 10 μM and the other with an IC50 threshold of 1 μM. Each dataset was rationally split into training sets (TR) and test sets (TS). Several hundred CMs were obtained using different TR sets, many different “decision tree” algorithms and different end-point thresholds for binary classification. The relevant TS sets were used to assess model performance and to reduce the number of models to 15 for each of the two IC50-threshold datasets. The rigorously validated models were further tested for their predictive capability on the prediction set. Although individual models that proved to have the best predictive capability would be useful, we found that using the entire battery of 15 models for each of the datasets strengthens the predictive power significantly — approaching 100% certainty for molecules within the applicability domain of the models. These CM batteries should be quite valuable to assess the potential any new chemical entity proposed for synthesis as an inhibitor of tubulin polymerization in an anticancer drug discovery program.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.