TD(lambda)
E636114
reinforcement learning algorithm
temporal-difference learning algorithm
value function learning method
TD(λ) is a temporal-difference reinforcement learning algorithm that blends multi-step returns using a decay parameter λ to efficiently estimate value functions from bootstrapped experience.
Observed surface forms (1)
| Surface form | Occurrences |
|---|---|
| TD(λ) | 0 |
Statements (47)
| Predicate | Object |
|---|---|
| instanceOf |
reinforcement learning algorithm
ⓘ
temporal-difference learning algorithm ⓘ value function learning method ⓘ |
| aimsTo | minimize prediction error of value function ⓘ |
| approaches | Monte Carlo method NERFINISHED ⓘ |
| assumes | stationary environment dynamics ⓘ |
| blends |
Monte Carlo returns
ⓘ
n-step returns ⓘ one-step TD returns ⓘ |
| canBeCombinedWith |
function approximation
ⓘ
linear value function approximation ⓘ nonlinear value function approximation ⓘ |
| canEstimate | action-value function ⓘ |
| category | model-free reinforcement learning method ⓘ |
| computes | TD error δt ⓘ |
| controlsBiasVarianceTradeoffWith | λ ⓘ |
| describedIn | Reinforcement Learning: An Introduction NERFINISHED ⓘ |
| estimates | state-value function ⓘ |
| generalizes | TD(0) ⓘ |
| hasHyperparameter | λ ⓘ |
| hasParameter |
discount factor γ
ⓘ
learning rate ⓘ value function representation ⓘ λ ⓘ |
| hasView |
backward view
ⓘ
forward view ⓘ |
| implements | backward view of multi-step returns ⓘ |
| introducedIn | reinforcement learning literature ⓘ |
| isBasedOn | TD(0) ⓘ |
| isRelatedTo |
Q(λ)
ⓘ
SARSA(λ) NERFINISHED ⓘ eligibility-trace methods ⓘ |
| isUsedFor |
policy evaluation
ⓘ
prediction problems in reinforcement learning ⓘ |
| operatesOn | sequences of states and rewards ⓘ |
| popularizedBy | Richard S. Sutton NERFINISHED ⓘ |
| propagates | TD errors backward through time ⓘ |
| reducesTo |
Monte Carlo evaluation when λ = 1 (under episodic tasks and certain conditions)
ⓘ
TD(0) when λ = 0 ⓘ |
| requires | Markov decision process setting NERFINISHED ⓘ |
| updates | value estimates after each time step ⓘ |
| uses |
bootstrapped targets
ⓘ
temporal-difference error ⓘ |
| usesConcept |
bootstrapping
ⓘ
eligibility traces ⓘ multi-step returns ⓘ temporal-difference learning ⓘ |
Referenced by (1)
Full triples — surface form annotated when it differs from this entity's canonical label.