Small Large Language Models

This list curates smaller Large Language Models (SLM) that could be deployed to edge devices. It investigates the model origin, including number of parameters and optimization techniques and checks the reproducability. Due to the new models coming out everyday with new variations, this list is incomplete, but feel free to add or update any entries on the GitHub.

Name	Cite	Release	Parent Model	Parameters	Opt.*	GPU req**	Repr***	Note
GPT-2	[^1^]	2019	GPT	117M-1.5B	None	-	-	GPT-2 sm-xl, archived
minGPT		2022	GPT2	117M-1.5B	None	None	tbd	Small reimplementation of GPT, semi-archived
nanoGPT		2023	GPT2	117M-1.5B	None	None	Available	XL OOM, large ~5GB CPU only works
picoGPT		2023	GPT2	117M-1.5B	None	None	Available	Unnecessarily small implementation
Lit-GPT		2023	Various	70M-180B	4-8 bit quan.	tbd	tbd	Hackable implementation of OS LLMs
DistilGPT	[^2^]	2020	GPT2	82M	KD	tbd	tbd	Distilled from 124M GPT2
GPT-Neo	[^3^]	2020	GPT3	125M-2.7B	TF Mesh	Yes	tbd	TPU, GPU optimized, archived
GPT-NeoX	[^4^]	2022	Megatron-LM	20B	DeepSpeed	Yes	tbd	LLM training optimized
OpenLLaMa	[^5^]	2023	LLaMa	3-13B	EasyLM	3GB-?	tbd	Reproduction of Meta’s LLaMa
ExLlama		2023	LLaMa	7-33B	4-bit GPTQ	3-21GB	tbd	Fast and memory-efficient on modern GPUs
ExLlamav2		2023	LLaMa 2	1.1-34B	2.5-5 bit quan.	tbd	tbd	Running local LLMs on modern consumer GPUs
TinyLLaMa	[^6^]	2023	LLaMa 2	1.1B	2-8 bit quan. (AWQ, GPTQ, GGUF)	None - 30	Available	use case on edge device
MiniLLM		2023	LLaMa, BLOOM, OPT	7B-65B	4 bit quan.	6-40GB	Available	Meant to run on consumer-grade GPUs
GPTQ4LLaMa		2023	LLaMa	7B-33B	4-bit quan.	None	tbd	semi-archived, followed by AutoGPTQ
MiniGPT-4	[^7^8^]	2023	LLaMa 2, Vicuna V0	7B, 7-13B	8-bit quan	11.5-23GB	tbd	LLama-2 with GPT-4 combination, but requires GPU
TinyGPT-V	-	-	-	-	-	-
Mystral AI	[^9^]	2023	Mystral	7B	SWA	tbd	tbd	Better than LLaMa 2 13B
GGML LLaMa 2		2023	LLaMa 2	7-70B	2-8 bit quan.	2-10GB RAM	-	Meta’s Llama 2 7B GGML run with llama.cpp
EdgeLM	[^9^]	2022	None	9.4M	tbd	Unknown	No	Pre-trained model is unavailable, issue pending
MiniLM	[^10^]	2021	XLMR, RoBERTa, BERT	107-117M, 30-81M, 22-66M	tbd	tbd	tbd	Based on older transformers
FastChat-T5		2023	T5, LLaMa	3B, 7B-33B	tbd	tbd	tbd	Allows CPU only, but min 30GB
Primer	[^9^]	2021	T-5	1.9B	tbd	tbd	tbd	Identification of efficient LM
Combiner	[^11^]	2021	Transformer	-	tbd	tbd	tbd	Sparse attention matrices
EasyLM		-	LLaMa, GPT-J, RoBERTa	7-65B	tbd	tbd	tbd	Jax/Flax implementation
phi3-mini	[12]	2024	Phi	3.8B	-	-	-	design for mobile
Other models	-	-	-	-	-	-
DallE Mini		2022	-	-	-	-	-	-
Edge Inference frameworks	-	-	-	-	-	-
llama.cpp	-	-	-	-	-	-
nitro	-	-	-	-	-	production ready implementation of llama.cpp
jan	-	-	-	-	-	GUI for chat with nitro under the hood
Other resources	-	-	-	-	-	-
Edge-MoE	[^13^]	2023	-	-	-	-	-	Mixture-of-Experts for general Transformers and video
AutoGPTQ		2023	LLaMa, Moss-Moon, GPT-j	rsp. 7B, 16B, 6B	tbd	tbd	tbd	Work towards v1
bitsandbytes	[^14^15^]	2022	-	-	8 bit quan.	-	-	wrapper for CUDA quantization
MLC LLM	[^16^]	2023	-	-	-	-	-	Machine Learning Compilation for LLMs
MCUnet	[^17^]	2020	-	0.75M-1.73M	-	-	-

* Optimzation technique: Quantization (AWQ, GPTQ, GGUF llama.cpp , Knowledge Distilation (KD), Sliding Window Attention (SWA)
** GPU Requirements for inference, tested on VRAM and compatible types
*** Reproducability, tested on a consumer grade GPU and edge

References

[12] Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., … & Zhou, X. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219.