TRPO

E98480

TRPO (Trust Region Policy Optimization) is a reinforcement learning algorithm that optimizes policies with guaranteed monotonic improvement by constraining each update within a trust region to maintain stability.

All labels observed (3)

How this entity was disambiguated

Statements (51)

Predicate Object
instanceOf reinforcement learning algorithm
algorithmClass trust region method
applicableTo continuous control tasks
discrete action spaces
high-dimensional control problems
approximates natural policy gradient
category policy optimization algorithm
coAuthor Michael Jordan
Phil Moritz
Pieter Abbeel
Sergey Levine
constraintType KL divergence constraint
trust region
field artificial intelligence
machine learning
reinforcement learning
firstAuthor John Schulman
fullName TRPO self-linksurface differs
surface form: Trust Region Policy Optimization
hasAbbreviation TRPO self-link
inspired design of PPO
introducedInPaper TRPO self-linksurface differs
surface form: Trust Region Policy Optimization
isOnPolicy true
keyIdea constraining KL divergence between old and new policy
guaranteed monotonic policy improvement under assumptions
surrogate objective maximization
trust region constraint on policy updates
limitation computationally expensive due to second-order optimization
on-policy sample inefficiency
objective expected return
policy performance
optimizationType constrained optimization
optimizes parameterized policies
stochastic policies
publicationVenue ICML
surface form: International Conference on Machine Learning
publicationYear 2015
relatedTo Actor-Critic methods
Natural Policy Gradient
PPO
REINFORCE
requires estimation of advantages from trajectories
rollouts from current policy
stabilityProperty monotonic improvement guarantee under certain conditions
prevents large destructive policy updates
updateType batch policy update
usedWith deep neural network policies
value function baselines
uses advantage function estimates
conjugate gradient optimization
importance sampling ratios
line search
policy gradient methods

How these facts were elicited

Referenced by (13)

Full triples — surface form annotated when it differs from this entity's canonical label.

John Schulman notableWork TRPO
this entity surface form: Trust Region Policy Optimization
John Schulman authorOf TRPO
this entity surface form: “Trust Region Policy Optimization”
ACKTR comparedWith TRPO
PPO relatedTo TRPO
PPO improvesUpon TRPO
PPO2 contrastsWith TRPO
this entity surface form: Trust Region Policy Optimization
TRPO fullName TRPO self-linksurface differs
this entity surface form: Trust Region Policy Optimization
TRPO hasAbbreviation TRPO self-link
TRPO introducedInPaper TRPO self-linksurface differs
this entity surface form: Trust Region Policy Optimization
Proximal Policy Optimization relatedTo TRPO
this entity surface form: Trust Region Policy Optimization
Proximal Policy Optimization comparedTo TRPO
this entity surface form: Trust Region Policy Optimization
Generalized Advantage Estimation compatibleWith TRPO
this entity surface form: Trust Region Policy Optimization