PPO2

E98479

PPO2 is an improved variant of the Proximal Policy Optimization reinforcement learning algorithm, designed for stable and efficient policy gradient training in continuous and discrete control tasks.


Statements (47)
Predicate Object
instanceOf policy gradient method
reinforcement learning algorithm
abbreviationOf Proximal Policy Optimization 2 NERFINISHED
aimsTo improve sample efficiency
improve training stability
avoids second-order optimization used in TRPO
basedOn Proximal Policy Optimization NERFINISHED
commonlyUsedFor benchmark continuous control tasks
game-playing agents
robotics control tasks
commonlyUsedWith OpenAI Gym environments NERFINISHED
contrastsWith Trust Region Policy Optimization NERFINISHED
controls policy update step size via clipping parameter
designedFor continuous control tasks
discrete control tasks
efficient policy gradient training
stable policy gradient training
goal balance exploration and exploitation
prevent destructive policy updates
hasFeature clipped value function loss
entropy regularization
mini-batch stochastic gradient descent
multiple epochs over the same batch of data
separate policy and value networks
value function baseline
hasHyperparameter GAE lambda
clip range
discount factor gamma
entropy coefficient
learning rate
mini-batch size
number of epochs
value function coefficient
improvesUpon original PPO implementation details
isImplementedIn Stable-Baselines NERFINISHED
Stable-Baselines3 (as PPO successor, conceptually similar) NERFINISHED
isVariantOf PPO NERFINISHED
optimizes stochastic policies
supports on-policy learning
supportsActionSpaces continuous action spaces
discrete action spaces
trainingType actor-critic
updateType first-order optimization
uses advantage estimation
clipped surrogate objective
generalized advantage estimation
gradient-based optimization

Referenced by (1)
Subject (surface form when different) Predicate
OpenAI Baselines
implementsAlgorithm

Please wait…