Big Bird: Transformers for Longer Sequences (@ NeurIPS 2020)

This paper shows one way to sparsify the attention mechanism in Transformers, so that its complexity drops to linear in the input sequence length.