Concepts

Overview of SLMs

slm-overview

The number of parameters in SLM models and the amount of data used for training (the number of tokens) are closely related, with the Chinchilla law suggesting that the optimal ratio between the number of model parameters and training tokens should be around 20 (e.g. a 1B model with 20B tokens)
A notable observation is that SLMs are trained on much large number of tokens (typically over 1.5T) than what is suggested by the Chinchilla law, regardless of their parameter size
SLMs are significantly "over-trained", as compared to the Chinchilla law
The rationale behind this approach is to deploy powerful SLMs on resource-constrained devices by using more training-time FLOPs

In general, there is a positive correlation between the two metrics, especially for those with less than 700B tokens
The correlation is weak, since the data quality often outweights the impacts of more training tokens, espcecially when the training tokens exceed 1T
The importance of data quality to the final SLM capability typically outweighs the data quantity and model architecture configurations

Matrix by vector multiplication is the most time-consuming operations of SLM, which constiture more than 70% end-to-end inference time
Context length is crucial for model runtime memory usage. When context length gets to 32,000m the KV cache will take up over 80% memory

The benefits of quantization for reducing inference latency on server-side GPUs likely stem from three factors:

On mobile devices, such as Jetson, support for int8 computation is lacking, but memory access overhead can still be reduced
It comes from data compression due to the lower precision of activation values and parameters, which in turn improves cache utilization
For the prefill phase, when the prompt length is relatively short, quantization can reduce latency by at least 25%. However, this benefit diminishes as the prompt length increases
Quantization during the decode stage delivers more consistent performance gains, reducing decode latency by up to 75% and no less than 17%, because decode stage is more memory-bound
More regular quantization precision leads to better performance. Altohugh 3-bit quantization offers a higher model compression rate, 4-bit quantizations performs better in both the prefill and decode stages. The inferior performance of 3-bit quantization is due to its irregular bit-width, which lacks hardware optimzation support and incurs additional overhead from data alignment and padding. As a results, despite its lower compression rate, 4-bit quantization is more efficient

Qn_K (Qn_K_M): the quantization of a model to $n$ bits using the $k$ -quants methods with a medium (M) number of parameters
Qn_0: symmetric quantization of a model to $n$ bits

A powerful quantization format replacing GGML
GGML is a C++ Tensor library designed for machine learning, facilitating the running of LLMs either on a CPU alone or in tandem with a GPU