Hindsight Policy Gradients

E441117

Hindsight Policy Gradients is a reinforcement learning algorithm that extends policy gradient methods by retrospectively reinterpreting failed trajectories as successes for alternative goals, improving learning efficiency in sparse-reward environments.

Try in SPARQL Jump to: Surface forms Statements Referenced by

All labels observed (1)

Label Occurrences
Hindsight Policy Gradients canonical 1

Statements (49)

Predicate Object
instanceOf goal-conditioned reinforcement learning method
policy gradient method
reinforcement learning algorithm
addressesProblem sample inefficiency in policy gradient methods
sparse reward reinforcement learning
appliedTo navigation tasks
robotic manipulation tasks
arXivId arXiv:1805. hindsight-policy-gradients (approximate, not exact id)
category model-free reinforcement learning
comparedWith actor-critic methods without hindsight
standard REINFORCE
evaluationMetric final task success rate
learning speed in sparse reward settings
extends REINFORCE algorithm NERFINISHED
standard policy gradient methods
improves sample efficiency of policy gradient methods
introducedBy Alex Ray NERFINISHED
Bob McGrew NERFINISHED
Filip Wolski NERFINISHED
Jonas Schneider NERFINISHED
Josh Tobin NERFINISHED
Marcin Andrychowicz NERFINISHED
OpenAI researchers
Peter Welinder NERFINISHED
Rachel Fong NERFINISHED
introducedInPaper Hindsight Policy Gradients NERFINISHED
keyIdea derive unbiased policy gradient estimators with hindsight goals
reinterpret failed trajectories as successful for alternative goals
use hindsight to construct additional learning signals
operatesOn continuous control tasks
goal-conditioned Markov decision processes
sparse reward environments
optimizationTarget expected return over goals
provides unbiased gradient estimator under certain assumptions
publishedAs arXiv preprint
relatedTo Hindsight Experience Replay NERFINISHED
goal-conditioned policies
off-policy reinforcement learning
on-policy reinforcement learning
requires a goal-conditioned reward function
access to achieved goals along a trajectory
supports continuous action spaces
high-dimensional state spaces
uses importance sampling ratios for goal relabeling
usesConcept goal relabeling
hindsight
importance sampling
policy gradients
yearIntroduced 2018

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.

Hindsight Experience Replay influenced Hindsight Policy Gradients