Skip to main content

Which Transformer to Favor

which-transformer-to-favor

Abstract

Analyze the results based on the Pareto front - the boundary of optimal models
Hybrid attention-CNN models exhibit remarkable inference memory and parameter-efficiency
Our benchmark shows that using a larger model in general is more efficient than using higher resolution images

Introduction

Self-attention enables the Transformer to capture global dependencies between pairs of tokens, but it has a computational complexity of $\mathcal{O}(N^2)$ in the input length $N$ , making it impractical for long sequences and high-resolution images
We plot inference throughput against accuracy to visualize the Pareto front
A model is Pareto optimal if and only if there is no model that is both more accuracte and faster at the same time

Core Elements of ViT

$A = \text{softmax} \left( \frac{QK^T}{\sqrt{d_{\text{head}}}} \right) \in \mathbb{R}^{N \times N}$

$X_{\text{out}} = AV = \text{softmax} \left( \frac{QK^T}{\sqrt{d_{\text{head}}}} \right)V$

Efficient Transformers for Vision

Efficiency-Improving Changes

Token Mixing

Low-rank attention
Sparse attention
Fixed attention
Kernel attention
Hybrid attention
Fourier attention
1. Uses FFT to reduce the complexity to $\mathcal{O}(N \log N)$
Non-attention shuffling

Token Sequence

Token removal
Token merging
Summary tokens

MLP Block

More MLPs

Abstract
Introduction
Core Elements of ViT
Efficient Transformers for Vision
- Efficiency-Improving Changes