A2C
E98476
A2C (Advantage Actor-Critic) is a popular synchronous policy gradient reinforcement learning algorithm that combines value-based and policy-based methods to improve training stability and efficiency.
All labels observed (1)
| Label | Occurrences |
|---|---|
| A2C canonical | 4 |
Statements (48)
| Predicate | Object |
|---|---|
| instanceOf |
actor-critic method
ⓘ
policy gradient method ⓘ reinforcement learning algorithm ⓘ |
| actorOutputs | action probabilities ⓘ |
| actorUpdatedWith | advantage-weighted log-probabilities ⓘ |
| advantageDefinition | A(s,a) = Q(s,a) - V(s) ⓘ |
| canHandle |
continuous observation spaces
ⓘ
discrete action spaces ⓘ high-dimensional state spaces ⓘ |
| canUse | multiple parallel environments ⓘ |
| category | deep reinforcement learning ⓘ |
| combines |
policy-based methods
ⓘ
value-based methods ⓘ |
| criticOutputs | state-value estimate ⓘ |
| criticTrainedWith | regression to returns or bootstrapped targets ⓘ |
| entropyBonusPurpose | encourage exploration ⓘ |
| fullName |
Asynchronous Advantage Actor-Critic
ⓘ
surface form:
Advantage Actor-Critic
|
| goal |
improve sample efficiency
ⓘ
improve training stability ⓘ reduce gradient variance ⓘ |
| implementedIn |
OpenAI Baselines
ⓘ
PyTorch-based RL libraries ⓘ Stable Baselines ⓘ Stable Baselines ⓘ
surface form:
Stable Baselines3
TensorFlow-based RL libraries ⓘ |
| isOnPolicy | true ⓘ |
| isPolicyBased | true ⓘ |
| isRelatedTo | A3C ⓘ |
| isSynchronous | true ⓘ |
| isSynchronousVariantOf | A3C ⓘ |
| isValueBased | true ⓘ |
| optimizes | stochastic policy ⓘ |
| reducesVarianceUsing |
advantage estimation
ⓘ
value function baseline ⓘ |
| trainingSignal | temporal-difference error ⓘ |
| typicalUseCase |
Atari game playing
ⓘ
continuous control tasks ⓘ discrete action tasks ⓘ |
| updateFrequency | multiple environment steps per update ⓘ |
| usesAdvantageFunction | true ⓘ |
| usesBaseline | state-value function ⓘ |
| usesFunctionApproximator | neural network ⓘ |
| usesLearningParadigm | model-free reinforcement learning ⓘ |
| usesLossComponent |
entropy regularization
ⓘ
policy loss ⓘ value loss ⓘ |
| usesObjective | policy gradient objective ⓘ |
| usesUpdateType | synchronous gradient updates ⓘ |
Referenced by (4)
Full triples — surface form annotated when it differs from this entity's canonical label.