PPO2
E98479
PPO2 is an improved variant of the Proximal Policy Optimization reinforcement learning algorithm, designed for stable and efficient policy gradient training in continuous and discrete control tasks.
All labels observed (1)
| Label | Occurrences |
|---|---|
| PPO2 canonical | 1 |
Statements (47)
| Predicate | Object |
|---|---|
| instanceOf |
policy gradient method
ⓘ
reinforcement learning algorithm ⓘ |
| abbreviationOf |
Proximal Policy Optimization
ⓘ
surface form:
Proximal Policy Optimization 2
|
| aimsTo |
improve sample efficiency
ⓘ
improve training stability ⓘ |
| avoids | second-order optimization used in TRPO ⓘ |
| basedOn | Proximal Policy Optimization ⓘ |
| commonlyUsedFor |
benchmark continuous control tasks
ⓘ
game-playing agents ⓘ robotics control tasks ⓘ |
| commonlyUsedWith |
OpenAI Gym
ⓘ
surface form:
OpenAI Gym environments
|
| contrastsWith |
TRPO
ⓘ
surface form:
Trust Region Policy Optimization
|
| controls | policy update step size via clipping parameter ⓘ |
| designedFor |
continuous control tasks
ⓘ
discrete control tasks ⓘ efficient policy gradient training ⓘ stable policy gradient training ⓘ |
| goal |
balance exploration and exploitation
ⓘ
prevent destructive policy updates ⓘ |
| hasFeature |
clipped value function loss
ⓘ
entropy regularization ⓘ mini-batch stochastic gradient descent ⓘ multiple epochs over the same batch of data ⓘ separate policy and value networks ⓘ value function baseline ⓘ |
| hasHyperparameter |
GAE lambda
ⓘ
clip range ⓘ discount factor gamma ⓘ entropy coefficient ⓘ learning rate ⓘ mini-batch size ⓘ number of epochs ⓘ value function coefficient ⓘ |
| improvesUpon | original PPO implementation details ⓘ |
| isImplementedIn |
Stable Baselines
ⓘ
surface form:
Stable-Baselines
Stable Baselines ⓘ
surface form:
Stable-Baselines3 (as PPO successor, conceptually similar)
|
| isVariantOf | PPO ⓘ |
| optimizes | stochastic policies ⓘ |
| supports | on-policy learning ⓘ |
| supportsActionSpaces |
continuous action spaces
ⓘ
discrete action spaces ⓘ |
| trainingType | actor-critic ⓘ |
| updateType | first-order optimization ⓘ |
| uses |
advantage estimation
ⓘ
clipped surrogate objective ⓘ generalized advantage estimation ⓘ gradient-based optimization ⓘ |
Referenced by (1)
Full triples — surface form annotated when it differs from this entity's canonical label.