TRPO

E98480

TRPO (Trust Region Policy Optimization) is a reinforcement learning algorithm that optimizes policies with guaranteed monotonic improvement by constraining each update within a trust region to maintain stability.


Statements (51)
Predicate Object
instanceOf reinforcement learning algorithm
algorithmClass trust region method
applicableTo continuous control tasks
discrete action spaces
high-dimensional control problems
approximates natural policy gradient
category policy optimization algorithm
coAuthor Michael Jordan NERFINISHED
Phil Moritz NERFINISHED
Pieter Abbeel NERFINISHED
Sergey Levine NERFINISHED
constraintType KL divergence constraint
trust region
field artificial intelligence
machine learning
reinforcement learning
firstAuthor John Schulman NERFINISHED
fullName Trust Region Policy Optimization NERFINISHED
hasAbbreviation TRPO NERFINISHED
inspired design of PPO
introducedInPaper Trust Region Policy Optimization NERFINISHED
isOnPolicy true
keyIdea constraining KL divergence between old and new policy
guaranteed monotonic policy improvement under assumptions
surrogate objective maximization
trust region constraint on policy updates
limitation computationally expensive due to second-order optimization
on-policy sample inefficiency
objective expected return
policy performance
optimizationType constrained optimization
optimizes parameterized policies
stochastic policies
publicationVenue International Conference on Machine Learning NERFINISHED
publicationYear 2015
relatedTo Actor-Critic methods
Natural Policy Gradient NERFINISHED
PPO NERFINISHED
REINFORCE NERFINISHED
requires estimation of advantages from trajectories
rollouts from current policy
stabilityProperty monotonic improvement guarantee under certain conditions
prevents large destructive policy updates
updateType batch policy update
usedWith deep neural network policies
value function baselines
uses advantage function estimates
conjugate gradient optimization
importance sampling ratios
line search
policy gradient methods

Referenced by (3)
Subject (surface form when different) Predicate
John Schulman ("“Trust Region Policy Optimization”")
authorOf
OpenAI Baselines
implementsAlgorithm
John Schulman ("Trust Region Policy Optimization")
notableWork

Please wait…