We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.
Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime
Andrea Agazzi
;
2021-01-01
Abstract
We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.File | Dimensione | Formato | |
---|---|---|---|
2010.11858.pdf
solo utenti autorizzati
Tipologia:
Documento in Pre-print
Licenza:
NON PUBBLICO - accesso privato/ristretto
Dimensione
438.11 kB
Formato
Adobe PDF
|
438.11 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
global_optimality_of_softmax_p.pdf
solo utenti autorizzati
Tipologia:
Versione finale editoriale
Licenza:
NON PUBBLICO - accesso privato/ristretto
Dimensione
431 kB
Formato
Adobe PDF
|
431 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.