TRPO
E98480
TRPO (Trust Region Policy Optimization) is a reinforcement learning algorithm that optimizes policies with guaranteed monotonic improvement by constraining each update within a trust region to maintain stability.
Aliases (2)
Statements (51)
| Predicate | Object |
|---|---|
| instanceOf |
reinforcement learning algorithm
→
|
| algorithmClass |
trust region method
→
|
| applicableTo |
continuous control tasks
→
discrete action spaces → high-dimensional control problems → |
| approximates |
natural policy gradient
→
|
| category |
policy optimization algorithm
→
|
| coAuthor |
Michael Jordan
NERFINISHED
→
Phil Moritz NERFINISHED → Pieter Abbeel NERFINISHED → Sergey Levine NERFINISHED → |
| constraintType |
KL divergence constraint
→
trust region → |
| field |
artificial intelligence
→
machine learning → reinforcement learning → |
| firstAuthor |
John Schulman
NERFINISHED
→
|
| fullName |
Trust Region Policy Optimization
NERFINISHED
→
|
| hasAbbreviation |
TRPO
NERFINISHED
→
|
| inspired |
design of PPO
→
|
| introducedInPaper |
Trust Region Policy Optimization
NERFINISHED
→
|
| isOnPolicy |
true
→
|
| keyIdea |
constraining KL divergence between old and new policy
→
guaranteed monotonic policy improvement under assumptions → surrogate objective maximization → trust region constraint on policy updates → |
| limitation |
computationally expensive due to second-order optimization
→
on-policy sample inefficiency → |
| objective |
expected return
→
policy performance → |
| optimizationType |
constrained optimization
→
|
| optimizes |
parameterized policies
→
stochastic policies → |
| publicationVenue |
International Conference on Machine Learning
NERFINISHED
→
|
| publicationYear |
2015
→
|
| relatedTo |
Actor-Critic methods
→
Natural Policy Gradient NERFINISHED → PPO NERFINISHED → REINFORCE NERFINISHED → |
| requires |
estimation of advantages from trajectories
→
rollouts from current policy → |
| stabilityProperty |
monotonic improvement guarantee under certain conditions
→
prevents large destructive policy updates → |
| updateType |
batch policy update
→
|
| usedWith |
deep neural network policies
→
value function baselines → |
| uses |
advantage function estimates
→
conjugate gradient optimization → importance sampling ratios → line search → policy gradient methods → |
Referenced by (3)
| Subject (surface form when different) | Predicate |
|---|---|
|
John Schulman
("“Trust Region Policy Optimization”")
→
|
authorOf |
|
OpenAI Baselines
→
|
implementsAlgorithm |
|
John Schulman
("Trust Region Policy Optimization")
→
|
notableWork |