This is the now extremely influential paper that introduced the Transformer architecture. I had learned about it from other sources, but only now read the actual paper. A lot has been said about it recently, so there's no point in even explaining it again, as others have done a great job at it already. Plus, if you're familiar with how RNNs and attention mechanisms work, the paper is not hard to read.
The paper has more than 10k citations now, only 3 years after it has been published. BERT is getting there: 8k citations after two years. This also exemplifies the rate at which CS papers are coming out. At this pace, we'll need an "Arxiv Sanity Preserver" Sanity Preserver soon.