Big Bird: Transformers for Longer Sequences (@ NeurIPS 2020)
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed
This paper shows one way to sparsify the attention mechanism in Transformers,
so that its complexity drops to linear in the input sequence length.