PPO
E98478
PPO (Proximal Policy Optimization) is a popular reinforcement learning algorithm known for its stability and sample efficiency in training complex policies, especially in continuous control and high-dimensional environments.
All labels observed (1)
| Label | Occurrences |
|---|---|
| PPO canonical | 7 |
Statements (49)
| Predicate | Object |
|---|---|
| instanceOf | reinforcement learning algorithm ⓘ |
| abbreviationFor | Proximal Policy Optimization ⓘ |
| aimsFor |
sample efficiency
ⓘ
stable policy updates ⓘ |
| commonlyUsedIn |
MuJoCo control tasks
ⓘ
OpenAI Gym benchmarks ⓘ game playing ⓘ robotics control ⓘ |
| designedFor |
complex policies
ⓘ
continuous control tasks ⓘ high-dimensional environments ⓘ |
| developedBy | OpenAI ⓘ |
| fullName | Proximal Policy Optimization ⓘ |
| hasVariant |
Proximal Policy Optimization
ⓘ
surface form:
PPO-Clip
PPO-Penalty ⓘ |
| implementedIn |
PyTorch RL libraries
ⓘ
RLlib ⓘ Stable Baselines ⓘ
surface form:
Stable-Baselines3
TF-Agents ⓘ
surface form:
TensorFlow Agents
|
| improvesUpon | TRPO ⓘ |
| introducedInPaper |
Proximal Policy Optimization
ⓘ
surface form:
Proximal Policy Optimization Algorithms
|
| keyIdea |
approximates trust region methods without complex constraints
ⓘ
constrains policy updates to be proximal to the old policy ⓘ uses clipped surrogate objective ⓘ |
| objectiveIncludes | entropy bonus (in many implementations) ⓘ |
| oftenCombinedWith |
App Engine
ⓘ
surface form:
GAE
advantage estimation ⓘ |
| optimizationType |
on-policy
ⓘ
policy gradient ⓘ |
| primaryAuthors |
Alec Radford
ⓘ
Filip Wolski ⓘ John Schulman ⓘ Oleg Klimov ⓘ Prafulla Dhariwal ⓘ |
| property |
relatively easy to implement
ⓘ
robust to hyperparameter choices ⓘ widely adopted as a default RL baseline ⓘ |
| publicationYear | 2017 ⓘ |
| relatedTo |
A2C
ⓘ
A3C ⓘ TRPO ⓘ |
| supports |
continuous action spaces
ⓘ
discrete action spaces ⓘ |
| trainingStyle |
mini-batch updates
ⓘ
multiple epochs over collected trajectories ⓘ |
| uses |
clipping parameter epsilon
ⓘ
importance sampling ratio ⓘ stochastic gradient ascent ⓘ surrogate objective function ⓘ |
Referenced by (7)
Full triples — surface form annotated when it differs from this entity's canonical label.