Small Large Language Models

This list curates smaller Large Language Models (SLM) that could be deployed to edge devices. It investigates the model origin, including number of parameters and optimization techniques and checks the reproducability. Due to the new models coming out everyday with new variations, this list is incomplete, but feel free to add or update any entries on the GitHub.

Name Cite Release Parent Model Parameters Opt.* GPU req** Repr*** Note
GPT-2 [^1^] 2019 GPT 117M-1.5B None - - GPT-2 sm-xl, archived
minGPT   2022 GPT2 117M-1.5B None None tbd Small reimplementation of GPT, semi-archived
nanoGPT   2023 GPT2 117M-1.5B None None Available XL OOM, large ~5GB CPU only works
picoGPT   2023 GPT2 117M-1.5B None None Available Unnecessarily small implementation
Lit-GPT   2023 Various 70M-180B 4-8 bit quan. tbd tbd Hackable implementation of OS LLMs
DistilGPT [^2^] 2020 GPT2 82M KD tbd tbd Distilled from 124M GPT2
GPT-Neo [^3^] 2020 GPT3 125M-2.7B TF Mesh Yes tbd TPU, GPU optimized, archived
GPT-NeoX [^4^] 2022 Megatron-LM 20B DeepSpeed Yes tbd LLM training optimized
OpenLLaMa [^5^] 2023 LLaMa 3-13B EasyLM 3GB-? tbd Reproduction of Meta’s LLaMa
ExLlama   2023 LLaMa 7-33B 4-bit GPTQ 3-21GB tbd Fast and memory-efficient on modern GPUs
ExLlamav2   2023 LLaMa 2 1.1-34B 2.5-5 bit quan. tbd tbd Running local LLMs on modern consumer GPUs
TinyLLaMa [^6^] 2023 LLaMa 2 1.1B 2-8 bit quan. (AWQ, GPTQ, GGUF) None - 30 Available use case on edge device
MiniLLM   2023 LLaMa, BLOOM, OPT 7B-65B 4 bit quan. 6-40GB Available Meant to run on consumer-grade GPUs
GPTQ4LLaMa   2023 LLaMa 7B-33B 4-bit quan. None tbd semi-archived, followed by AutoGPTQ
MiniGPT-4 [^7^8^] 2023 LLaMa 2, Vicuna V0 7B, 7-13B 8-bit quan 11.5-23GB tbd LLama-2 with GPT-4 combination, but requires GPU
TinyGPT-V - - - - - -    
Mystral AI [^9^] 2023 Mystral 7B SWA tbd tbd Better than LLaMa 2 13B
GGML LLaMa 2   2023 LLaMa 2 7-70B 2-8 bit quan. 2-10GB RAM - Meta’s Llama 2 7B GGML run with llama.cpp
EdgeLM [^9^] 2022 None 9.4M tbd Unknown No Pre-trained model is unavailable, issue pending
MiniLM [^10^] 2021 XLMR, RoBERTa, BERT 107-117M, 30-81M, 22-66M tbd tbd tbd Based on older transformers
FastChat-T5   2023 T5, LLaMa 3B, 7B-33B tbd tbd tbd Allows CPU only, but min 30GB
Primer [^9^] 2021 T-5 1.9B tbd tbd tbd Identification of efficient LM
Combiner [^11^] 2021 Transformer - tbd tbd tbd Sparse attention matrices
EasyLM   - LLaMa, GPT-J, RoBERTa 7-65B tbd tbd tbd Jax/Flax implementation
phi3-mini [12] 2024 Phi 3.8B - - - design for mobile
Other models - - - - - -    
DallE Mini   2022 - - - - - -
Edge Inference frameworks - - - - - -    
llama.cpp - - - - - -    
nitro - - - - - production ready implementation of llama.cpp    
jan - - - - - GUI for chat with nitro under the hood    
Other resources - - - - - -    
Edge-MoE [^13^] 2023 - - - - - Mixture-of-Experts for general Transformers and video
AutoGPTQ   2023 LLaMa, Moss-Moon, GPT-j rsp. 7B, 16B, 6B tbd tbd tbd Work towards v1
bitsandbytes [^14^15^] 2022 - - 8 bit quan. - - wrapper for CUDA quantization
MLC LLM [^16^] 2023 - - - - - Machine Learning Compilation for LLMs
MCUnet [^17^] 2020 - 0.75M-1.73M - - -  
  • * Optimzation technique: Quantization (AWQ, GPTQ, GGUF llama.cpp , Knowledge Distilation (KD), Sliding Window Attention (SWA)
  • ** GPU Requirements for inference, tested on VRAM and compatible types
  • *** Reproducability, tested on a consumer grade GPU and edge

References

[12] Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., … & Zhou, X. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219.