PPO

E98478

PPO (Proximal Policy Optimization) is a popular reinforcement learning algorithm known for its stability and sample efficiency in training complex policies, especially in continuous control and high-dimensional environments.


Statements (49)
Predicate Object
instanceOf reinforcement learning algorithm
abbreviationFor Proximal Policy Optimization NERFINISHED
aimsFor sample efficiency
stable policy updates
commonlyUsedIn MuJoCo control tasks
OpenAI Gym benchmarks
game playing
robotics control
designedFor complex policies
continuous control tasks
high-dimensional environments
developedBy OpenAI NERFINISHED
fullName Proximal Policy Optimization NERFINISHED
hasVariant PPO-Clip NERFINISHED
PPO-Penalty
implementedIn PyTorch RL libraries
RLlib NERFINISHED
Stable-Baselines3 NERFINISHED
TensorFlow Agents NERFINISHED
improvesUpon TRPO NERFINISHED
introducedInPaper Proximal Policy Optimization Algorithms NERFINISHED
keyIdea approximates trust region methods without complex constraints
constrains policy updates to be proximal to the old policy
uses clipped surrogate objective
objectiveIncludes entropy bonus (in many implementations)
oftenCombinedWith GAE NERFINISHED
advantage estimation
optimizationType on-policy
policy gradient
primaryAuthors Alec Radford NERFINISHED
Filip Wolski NERFINISHED
John Schulman NERFINISHED
Oleg Klimov NERFINISHED
Prafulla Dhariwal NERFINISHED
property relatively easy to implement
robust to hyperparameter choices
widely adopted as a default RL baseline
publicationYear 2017
relatedTo A2C NERFINISHED
A3C NERFINISHED
TRPO NERFINISHED
supports continuous action spaces
discrete action spaces
trainingStyle mini-batch updates
multiple epochs over collected trajectories
uses clipping parameter epsilon
importance sampling ratio
stochastic gradient ascent
surrogate objective function

Referenced by (3)

Please wait…