Small Large Language Models
This list curates smaller Large Language Models (SLM) that could be deployed to edge devices. It investigates the model origin, including number of parameters and optimization techniques and checks the reproducability. Due to the new models coming out everyday with new variations, this list is incomplete, but feel free to add or update any entries on the GitHub.
Name | Cite | Release | Parent Model | Parameters | Opt.* | GPU req** | Repr*** | Note |
---|---|---|---|---|---|---|---|---|
GPT-2 | [^1^] | 2019 | GPT | 117M-1.5B | None | - | - | GPT-2 sm-xl, archived |
minGPT | 2022 | GPT2 | 117M-1.5B | None | None | tbd | Small reimplementation of GPT, semi-archived | |
nanoGPT | 2023 | GPT2 | 117M-1.5B | None | None | Available | XL OOM, large ~5GB CPU only works | |
picoGPT | 2023 | GPT2 | 117M-1.5B | None | None | Available | Unnecessarily small implementation | |
Lit-GPT | 2023 | Various | 70M-180B | 4-8 bit quan. | tbd | tbd | Hackable implementation of OS LLMs | |
DistilGPT | [^2^] | 2020 | GPT2 | 82M | KD | tbd | tbd | Distilled from 124M GPT2 |
GPT-Neo | [^3^] | 2020 | GPT3 | 125M-2.7B | TF Mesh | Yes | tbd | TPU, GPU optimized, archived |
GPT-NeoX | [^4^] | 2022 | Megatron-LM | 20B | DeepSpeed | Yes | tbd | LLM training optimized |
OpenLLaMa | [^5^] | 2023 | LLaMa | 3-13B | EasyLM | 3GB-? | tbd | Reproduction of Meta’s LLaMa |
ExLlama | 2023 | LLaMa | 7-33B | 4-bit GPTQ | 3-21GB | tbd | Fast and memory-efficient on modern GPUs | |
ExLlamav2 | 2023 | LLaMa 2 | 1.1-34B | 2.5-5 bit quan. | tbd | tbd | Running local LLMs on modern consumer GPUs | |
TinyLLaMa | [^6^] | 2023 | LLaMa 2 | 1.1B | 2-8 bit quan. (AWQ, GPTQ, GGUF) | None - 30 | Available | use case on edge device |
MiniLLM | 2023 | LLaMa, BLOOM, OPT | 7B-65B | 4 bit quan. | 6-40GB | Available | Meant to run on consumer-grade GPUs | |
GPTQ4LLaMa | 2023 | LLaMa | 7B-33B | 4-bit quan. | None | tbd | semi-archived, followed by AutoGPTQ | |
MiniGPT-4 | [^7^8^] | 2023 | LLaMa 2, Vicuna V0 | 7B, 7-13B | 8-bit quan | 11.5-23GB | tbd | LLama-2 with GPT-4 combination, but requires GPU |
TinyGPT-V | - | - | - | - | - | - | ||
Mystral AI | [^9^] | 2023 | Mystral | 7B | SWA | tbd | tbd | Better than LLaMa 2 13B |
GGML LLaMa 2 | 2023 | LLaMa 2 | 7-70B | 2-8 bit quan. | 2-10GB RAM | - | Meta’s Llama 2 7B GGML run with llama.cpp | |
EdgeLM | [^9^] | 2022 | None | 9.4M | tbd | Unknown | No | Pre-trained model is unavailable, issue pending |
MiniLM | [^10^] | 2021 | XLMR, RoBERTa, BERT | 107-117M, 30-81M, 22-66M | tbd | tbd | tbd | Based on older transformers |
FastChat-T5 | 2023 | T5, LLaMa | 3B, 7B-33B | tbd | tbd | tbd | Allows CPU only, but min 30GB | |
Primer | [^9^] | 2021 | T-5 | 1.9B | tbd | tbd | tbd | Identification of efficient LM |
Combiner | [^11^] | 2021 | Transformer | - | tbd | tbd | tbd | Sparse attention matrices |
EasyLM | - | LLaMa, GPT-J, RoBERTa | 7-65B | tbd | tbd | tbd | Jax/Flax implementation | |
phi3-mini | [12] | 2024 | Phi | 3.8B | - | - | - | design for mobile |
Other models | - | - | - | - | - | - | ||
DallE Mini | 2022 | - | - | - | - | - | - | |
Edge Inference frameworks | - | - | - | - | - | - | ||
llama.cpp | - | - | - | - | - | - | ||
nitro | - | - | - | - | - | production ready implementation of llama.cpp | ||
jan | - | - | - | - | - | GUI for chat with nitro under the hood | ||
Other resources | - | - | - | - | - | - | ||
Edge-MoE | [^13^] | 2023 | - | - | - | - | - | Mixture-of-Experts for general Transformers and video |
AutoGPTQ | 2023 | LLaMa, Moss-Moon, GPT-j | rsp. 7B, 16B, 6B | tbd | tbd | tbd | Work towards v1 | |
bitsandbytes | [^14^15^] | 2022 | - | - | 8 bit quan. | - | - | wrapper for CUDA quantization |
MLC LLM | [^16^] | 2023 | - | - | - | - | - | Machine Learning Compilation for LLMs |
MCUnet | [^17^] | 2020 | - | 0.75M-1.73M | - | - | - |
- * Optimzation technique: Quantization (AWQ, GPTQ, GGUF llama.cpp , Knowledge Distilation (KD), Sliding Window Attention (SWA)
- ** GPU Requirements for inference, tested on VRAM and compatible types
- *** Reproducability, tested on a consumer grade GPU and edge
References
[12] Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., … & Zhou, X. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219.