A2C
E98476
A2C (Advantage Actor-Critic) is a popular synchronous policy gradient reinforcement learning algorithm that combines value-based and policy-based methods to improve training stability and efficiency.
Statements (48)
| Predicate | Object |
|---|---|
| instanceOf |
actor-critic method
→
policy gradient method → reinforcement learning algorithm → |
| actorOutputs |
action probabilities
→
|
| actorUpdatedWith |
advantage-weighted log-probabilities
→
|
| advantageDefinition |
A(s,a) = Q(s,a) - V(s)
→
|
| canHandle |
continuous observation spaces
→
discrete action spaces → high-dimensional state spaces → |
| canUse |
multiple parallel environments
→
|
| category |
deep reinforcement learning
→
|
| combines |
policy-based methods
→
value-based methods → |
| criticOutputs |
state-value estimate
→
|
| criticTrainedWith |
regression to returns or bootstrapped targets
→
|
| entropyBonusPurpose |
encourage exploration
→
|
| fullName |
Advantage Actor-Critic
NERFINISHED
→
|
| goal |
improve sample efficiency
→
improve training stability → reduce gradient variance → |
| implementedIn |
OpenAI Baselines
NERFINISHED
→
PyTorch-based RL libraries → Stable Baselines NERFINISHED → Stable Baselines3 NERFINISHED → TensorFlow-based RL libraries → |
| isOnPolicy |
true
→
|
| isPolicyBased |
true
→
|
| isRelatedTo |
A3C
→
|
| isSynchronous |
true
→
|
| isSynchronousVariantOf |
A3C
→
|
| isValueBased |
true
→
|
| optimizes |
stochastic policy
→
|
| reducesVarianceUsing |
advantage estimation
→
value function baseline → |
| trainingSignal |
temporal-difference error
→
|
| typicalUseCase |
Atari game playing
→
continuous control tasks → discrete action tasks → |
| updateFrequency |
multiple environment steps per update
→
|
| usesAdvantageFunction |
true
→
|
| usesBaseline |
state-value function
→
|
| usesFunctionApproximator |
neural network
→
|
| usesLearningParadigm |
model-free reinforcement learning
→
|
| usesLossComponent |
entropy regularization
→
policy loss → value loss → |
| usesObjective |
policy gradient objective
→
|
| usesUpdateType |
synchronous gradient updates
→
|
Referenced by (2)
| Subject (surface form when different) | Predicate |
|---|---|
|
OpenAI Baselines
→
|
implementsAlgorithm |
|
Stable Baselines
→
|
supportsAlgorithm |