Small language models, the more efficient, specialised versions of their LLM siblings, are passing under the radar of the AI community. .
“At the moment, I think SLMs are underutilized in industry, largely due to a mismatch between perception and reality,” AWS engineer Owais Kazi tells The Stack.
At a glance, the LLM boom appears to have overshadowed progress in small language models (SLMs). These models are roughly considered to fall between 1 million to 10 billion parameters, such as Google’s Gemma family and Meta’s lightweight Llama models, compared to the trillions of parameters in newer flagship LLMs like OpenAI’s GPT-5.
However, machine learning academics have published paper after paper on the capacity of SLMs to reduce compute consumption, cut down on costly cloud bills, and, at times, be a more appropriate fit for tasks.
"Much of the enterprise narrative still equates ‘better AI’ with ‘larger models,’ even though many real-world workloads don’t require the broad generalization capabilities of LLMs," Kazi tells The Stack while discussing his work on a paper exploring the use of SLMs for “efficient agentic tool calling”.
SLMs are trained on smaller data sets, often derived from LLM training data, and run on fewer parameters during inference. These smaller models have lower compute and training costs, and can run on edge devices to save again on cloud bills.
Misconceptions about SLM performance, and a lack of awareness on the technicalities of where they are best deployed, mean companies often "default to LLMs" for their AI use cases, says Kazi, "despite higher cost, latency, and operational complexity."
The SLM hype
A large language model is an AI computational model that can generate natural language. They are trained on massive data sets and usually require cloud-scale compute resources to run. Small language models have smaller parameters and are “fine-tuned” on a small data set for a specific task.
In the research world, "there's a lot of hype about SLMs" says Dr Federico Nanni, a senior research data scientist at the UK’s Alan Turing Institute leading research in adapting language models.
He points out the models we now dub ‘small’ have been around since the dawn of language models in the late 20th century, but “now that we have such large frontier models that we can’t deploy on-premise, the models we can deploy on site we’ve started calling SLMs.”
He tells The Stack: "We had these couple of years after ChatGPT came out [in 2022] and we thought our profession was over. And it was 'Well, now we have this tool and we don't need to do fine tuning anymore.'
"But, with the advent of small language models, I think people in research feel that we are doing our work again, we can try all new ideas with models that run on our servers that we can modify."
The LLM boom has also had the side effect of making it easier to train next-generation SLMs. The scraped and synthesized training data for LLMs can be distilled into smaller, more accurate data sets for training smaller models.
Nanni works on 'Project t0' at the Turing institute, exploring where leaner, locally run models can be used for situations with higher privacy and efficiency requirements.
The project built a small language model based on Alibaba’s open-weight Qwen2.5 LLMs that achieved “near-frontier reasoning performance on a complex real-world task” it was trained on. At 3 billion parameters, the model was lightweight enough to run on a laptop.
Agentic applications
One area SLMs have seen fresh attention is in agentic AI. Agentic AI systems largely rely on language models to perform the same task again and again, with little variation. SLMs excel when they’re used as specialised tools for specific tasks, says Nanni, who is currently researching the agentic topic with project t0.
An NVIDIA paper released in June 2025 said SLMs are “sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems.” The small models are built for "specialisation and iterative refinement," key elements for an effective agent.
The NVIDIA researchers argue SLMs are “more operationally suitable” for use in agentic systems than LLMs for reasons including that they are easier and quicker to train for specialised agentic routines, less likely to hallucinate due to their specialized training, and more affordable at an inference and training level.
Similarly, Kazi's team at AWS found specially trained SLMs can produce better results for agentic tool calling, an agent’s ability to interact with third-party tools like APIs, MCP servers and applications, than larger models.
Their fine-tuned version of Meta's 2022 OPT-350M model achieved a 77.55% pass rate on the ToolBench evaluation, above benchmark chain of thought (CoT) models like ChatGPT-CoT (26%), and ToolLLaMA-DFS (30.18%).
The AWS researchers said 350 million parameters was a "strategic sweet spot for tool-calling applications,” providing enough capacity to manage the tasks but without "the complexity overhead" that can confuse larger models and lead to inconsistent outputs.
Cutting costs, staying local
LLMs, the most common models used for AI workloads, require large amounts of compute which have sent data centre expenditures soaring for large enterprises and hyperscalers.
A report by Goldman Sachs published in February said enterprises expected data centre power demands to increase by 165% by 2030, driven by AI use. At a smaller level, a 2025 Akamai survey of EMEA businesses found 67% expected cloud costs to rise in the next year, as AI investment requires cloud computing workflows.
SLMs, with the ability to run on fewer GPUs either in the cloud or locally, can run at a lower cost than LLMs, and with smaller infrastructure demands.
IBM is one company finding success with SLMs, it tells The Stack. Its Client Zero strategy targeting $4.5 billion in productivity gains through the use of GenAI and similar tools largely focussed on using SLMs, with 155 AI use cases across departments like finance, procurement and HR, all with a tailored SLM.
The company, which built its own commercial family of Granite SLMs, uses the specialised AI models for tasks including answering IT support and HR tickets, automating tax processes, and finding logistics savings in its supply chain organisation.
On an episode of IBM’s Mixture of Experts podcast, Shobhit Varshney, a senior partner at IBM Consulting, said the company had found SLMs “deliver much higher accuracy with a much, much smaller footprint,” adding “that’s where you get gold, [on] the return on investment you get from these small models that you can then fine tune, and then run on devices.”
Life at the edge
Edge use cases also show significant potential for SLMs, says Kazi, especially "where latency, privacy and compute limits are critical."
Speaking to The Stack at Finland’s 2025 AI Summit, Hugging Face co-founder Thomas Wolf said running SLMs directly on edge devices would “diffuse” a “huge concentration of electricity consumption… which is a great recipe for resilience and a healthier way to power our AI system.”
Small models that can be used locally are faster. Speed is a “real advantage” of SMLs, explains Federico, using a coding autocomplete feature as an example. “If you have a model running on your laptop, that one will be way faster than relying on an endpoint.
“Because of latency, because all your information is here and accessible, instead of [your device] sending whatever snippets of code you're writing to an endpoint and getting it back.”
Privacy wins
The ability to run smaller, isolated models on local infrastructure also has obvious privacy benefits, Wolf tells The Stack. "If it's very close to you, it can instantly reply to requests, or it can be very private and have access to extremely personal data."
Nanni further explains, if “you have a model running on your machine, and where the data is also on your machine, then nothing gets out” to wider networks, or the internet.
Despite being more sceptical about SLM performance, Peter Schneider, senior product manager at software company the QT Group, says some QT customers are willing to take performance tradeoffs to keep coding AI models in their private cloud.
He tells The Stack: "Customers coming out of particularly regulated industries, so medical, defence, etc, they will not connect to some of these public clouds. So they're willing to take the trade-off as an enterprise for the lower software quality to use these models that they can either deploy in their own data center, or locally on a high-end Mac."
How to fine-tune
For all of the tasks outlined, SLMs perform best with fine-tuning, when a model is trained on specific data sets or taught to complete tasks in a specific and specialised way.
Testing various techniques for fine-tuning models can be time intensive and costly for a budget-constrained developer team. But, Kazi says tooling is becoming “increasingly accessible,” with his team using Hugging Face’s Transformer Reinforcement Learning library for its research.
The library offers a full stack of open source tools for training transformer language models, the neural networks driving the current AI-boom, including the Supervised Fine-Tuning trainer, an offline training method covering pre-processing, tokenization, and loss computation.
The fine-tuning process can also be as simple as using specific prompt engineering skills like budget forcing, as Nanni explains, “When you see that the model thinks that it has finished the reasoning process, you just force the model to continue thinking about it.”
Knowledge barriers
While Kazi says “from a purely infrastructure standpoint, many enterprise teams already have everything they need” to deploy SLMs, however he warns that "the expertise to use them well is unevenly distributed.”
"Effective fine-tuning requires an understanding of dataset design, task framing, evaluation metrics, and failure modes.
"In our work, the performance gains did not come from scale alone, but from task-specific supervision like tool calling and careful alignment with the deployment objective… Teams that invest in ML literacy and experimentation can move quickly, but without that investment, enterprises may struggle to extract the same value from fine-tuned SLMs.”
The simplest trick for enterprises looking to trial a SLM is to carefully define your use case before selecting a model, says Nanni. He encourages companies to assess whether an LLM would even be suitable for a selected task before spending the time testing the capabilities of a smaller model. The thinking being: why look small if big models already struggle.
If early testing shows potential for the task at hand, “then that's the angle where you can do some research,” he says, “it could be about fine-tuning, it could be about improving the reasoning process... it could be you do some data cleaning."
Where to embrace SLMs
Nanni says he is “generally a bit sceptical” that SLMs would be able to outperform LLMs, explaining that if the fine-tuning required to achieve such results for any one task was applied to a larger model it would likely be even more successful.
However, the novelty here is that a small model that can be deployed on-prem or even on a device can nearly match an LLM. “Knowing that a small model like 4B parameters or 1B parameters could perform similarly to GPT-5, that's already shocking. You can have a small model on your phone and that will perform similarly to a frontier model,” Nanni says.
Schneider says using SLMs often "comes at a discount" on performance, but many of his customers are going for a "hybrid architecture," running isolated tasks like code completion on smaller, local and open-source models while turning to larger, proprietary models for more advanced use cases where they excel.
Kazi says SLMs are best utilised in “domains where the task space is narrow, repeatable and well-defined,” such as search and retrieval orchestration, enterprise automation, and domain-specific assistants. In these areas “alignment and specialisation often matter more than raw model size."