We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.
Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime
Andrea Agazzi
;
2021-01-01
Abstract
We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.File | Dimensione | Formato | |
---|---|---|---|
2010.11858.pdf
accesso aperto
Tipologia:
Documento in Pre-print
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
438.11 kB
Formato
Adobe PDF
|
438.11 kB | Adobe PDF | Visualizza/Apri |
global_optimality_of_softmax_p.pdf
non disponibili
Tipologia:
Versione finale editoriale
Licenza:
NON PUBBLICO - accesso privato/ristretto
Dimensione
431 kB
Formato
Adobe PDF
|
431 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.