Self-attention enables the Transformer to capture global dependencies between pairs of tokens, but it has a computational complexity of O(N2) in the input length N, making it impractical for long sequences and high-resolution images
We plot inference throughput against accuracy to visualize the Pareto front
A model is Pareto optimal if and only if there is no model that is both more accuracte and faster at the same time