Q-learning
E455376
model-free reinforcement learning method
reinforcement learning algorithm
temporal-difference learning method
Q-learning is a model-free reinforcement learning algorithm that learns an action-value function to optimize decision-making by estimating the expected cumulative reward for each state-action pair.
Observed surface forms (1)
| Surface form | Occurrences |
|---|---|
| Double Q-learning | 1 |
Statements (47)
| Predicate | Object |
|---|---|
| instanceOf |
model-free reinforcement learning method
ⓘ
reinforcement learning algorithm ⓘ temporal-difference learning method ⓘ |
| assumes | discrete action space in basic form ⓘ |
| canBeExtendedTo | Deep Q-learning ⓘ |
| canBeImplementedWith | tabular representation ⓘ |
| canHandle |
stochastic rewards
ⓘ
stochastic transitions ⓘ |
| canUseExplorationStrategy |
epsilon-greedy policy
ⓘ
softmax action selection ⓘ |
| canUseFunctionApproximation |
linear function approximator
ⓘ
neural network ⓘ |
| convergesUnderConditions |
Markov decision process
ⓘ
decaying learning rate ⓘ sufficient exploration ⓘ |
| describedInPaper | Q-learning ⓘ |
| differsFrom | SARSA as on-policy vs off-policy ⓘ |
| doesNotRequire | model of environment dynamics ⓘ |
| estimates | expected cumulative reward ⓘ |
| hasAuthor | Christopher J. C. H. Watkins NERFINISHED ⓘ |
| hasCoAuthor | Peter Dayan NERFINISHED ⓘ |
| hasKeyEquation | Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') − Q(s,a)] ⓘ |
| isModelFree | true ⓘ |
| isOffPolicy | true ⓘ |
| isPartOf | reinforcement learning field ⓘ |
| isRelatedTo | SARSA NERFINISHED ⓘ |
| isSensitiveTo |
exploration schedule
ⓘ
learning rate choice ⓘ reward scaling ⓘ |
| isUsedFor | optimal policy learning ⓘ |
| isUsedIn |
autonomous decision-making
ⓘ
game playing ⓘ resource allocation ⓘ robotics control ⓘ |
| learns | action-value function ⓘ |
| operatesOn | state-action pairs ⓘ |
| policyDerivedBy | greedy action selection over Q-values ⓘ |
| publicationYear | 1992 ⓘ |
| publishedInJournal | Machine Learning NERFINISHED ⓘ |
| requires | reward signal ⓘ |
| solves | Markov decision process control problems ⓘ |
| updatesFrom | sample transitions ⓘ |
| usesDiscountFactor | gamma ⓘ |
| usesLearningRateParameter | alpha ⓘ |
| usesMaxOperatorOver | next-state action values ⓘ |
| usesUpdateRule | Bellman optimality equation NERFINISHED ⓘ |
| usesValueFunction | true ⓘ |
Referenced by (2)
Full triples — surface form annotated when it differs from this entity's canonical label.
this entity surface form:
Double Q-learning