Transformer-XL
E701503
Transformer-XL is a neural network architecture for language modeling that extends the Transformer with segment-level recurrence and relative positional encodings to better capture long-range dependencies.
Statements (48)
| Predicate | Object |
|---|---|
| instanceOf |
Transformer variant
ⓘ
language model architecture ⓘ neural network architecture ⓘ |
| addressesLimitationOf | standard Transformer context length ⓘ |
| aimsTo | capture long-range dependencies ⓘ |
| appliedTo |
character-level language modeling
ⓘ
word-level language modeling ⓘ |
| benchmarkedOn |
Enwik8
NERFINISHED
ⓘ
One Billion Word Benchmark NERFINISHED ⓘ WikiText-103 NERFINISHED ⓘ |
| describedInPaper | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context NERFINISHED ⓘ |
| designedFor | language modeling ⓘ |
| developedAt |
Carnegie Mellon University
NERFINISHED
ⓘ
Google Brain NERFINISHED ⓘ |
| evaluationSpeedupReason | reuse of cached hidden states ⓘ |
| extends | Transformer NERFINISHED ⓘ |
| hasFullName | Transformer eXtra Long NERFINISHED ⓘ |
| hasKeyConcept |
decoupling positional encoding from absolute positions
ⓘ
reusing hidden states from previous segments ⓘ |
| improves |
evaluation efficiency
ⓘ
modeling of long-term dependencies ⓘ training efficiency for long sequences ⓘ |
| improvesMetric | perplexity on language modeling benchmarks ⓘ |
| influenced | later long-context Transformer architectures ⓘ |
| introducedFeature |
relative positional encodings
ⓘ
segment-level recurrence ⓘ |
| memoryMechanismType | segment-level recurrence over hidden states ⓘ |
| outperforms | standard Transformer on long-context language modeling benchmarks ⓘ |
| paperPublishedAt | ACL 2019 NERFINISHED ⓘ |
| positionalEncodingType | relative positional encoding ⓘ |
| proposedBy |
Jaime Carbonell
NERFINISHED
ⓘ
Quoc V. Le NERFINISHED ⓘ Ruslan Salakhutdinov NERFINISHED ⓘ William W. Cohen NERFINISHED ⓘ Yiming Yang NERFINISHED ⓘ Zhilin Yang NERFINISHED ⓘ Zihang Dai NERFINISHED ⓘ |
| proposedIn | 2019 ⓘ |
| reduces | context fragmentation ⓘ |
| supports | longer effective context than vanilla Transformer ⓘ |
| trainingObjective | autoregressive language modeling ⓘ |
| uses |
layer normalization
ⓘ
memory mechanism ⓘ multi-head attention ⓘ position-wise feed-forward networks ⓘ relative positional embeddings ⓘ residual connections ⓘ self-attention ⓘ |
Referenced by (1)
Full triples — surface form annotated when it differs from this entity's canonical label.