Transformer-XL

E701503

Transformer-XL is a neural network architecture for language modeling that extends the Transformer with segment-level recurrence and relative positional encodings to better capture long-range dependencies.

Try in SPARQL Jump to: Statements Referenced by

Statements (48)

Predicate Object
instanceOf Transformer variant
language model architecture
neural network architecture
addressesLimitationOf standard Transformer context length
aimsTo capture long-range dependencies
appliedTo character-level language modeling
word-level language modeling
benchmarkedOn Enwik8 NERFINISHED
One Billion Word Benchmark NERFINISHED
WikiText-103 NERFINISHED
describedInPaper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context NERFINISHED
designedFor language modeling
developedAt Carnegie Mellon University NERFINISHED
Google Brain NERFINISHED
evaluationSpeedupReason reuse of cached hidden states
extends Transformer NERFINISHED
hasFullName Transformer eXtra Long NERFINISHED
hasKeyConcept decoupling positional encoding from absolute positions
reusing hidden states from previous segments
improves evaluation efficiency
modeling of long-term dependencies
training efficiency for long sequences
improvesMetric perplexity on language modeling benchmarks
influenced later long-context Transformer architectures
introducedFeature relative positional encodings
segment-level recurrence
memoryMechanismType segment-level recurrence over hidden states
outperforms standard Transformer on long-context language modeling benchmarks
paperPublishedAt ACL 2019 NERFINISHED
positionalEncodingType relative positional encoding
proposedBy Jaime Carbonell NERFINISHED
Quoc V. Le NERFINISHED
Ruslan Salakhutdinov NERFINISHED
William W. Cohen NERFINISHED
Yiming Yang NERFINISHED
Zhilin Yang NERFINISHED
Zihang Dai NERFINISHED
proposedIn 2019
reduces context fragmentation
supports longer effective context than vanilla Transformer
trainingObjective autoregressive language modeling
uses layer normalization
memory mechanism
multi-head attention
position-wise feed-forward networks
relative positional embeddings
residual connections
self-attention

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.

Layer Normalization usedIn Transformer-XL