Tech

“This is the year of agents… and cost-effective fine-tuning!”

April 28, 2025 SwapBrain

“2025 is the year of agents!”—This is what I heard repeatedly during cloud events. And as soon as I walk out of the conference rooms, I see everybody rushes to build agents with “Einstein-level reasoning”. The models are getting smarter at an incredible speed, so why not simply make API calls to the cloud providers and let the big models upgrade by themselves? However, do you ever notice: the more parameters are added to these LLM models, the more expensive they cost and the slower they become? Did you ever try calculating the API cost and end up with a MILLION-DOLLAR BUDGET???

Here is what cloud providers never tell you: This year also witnesses the maturity of fine-tuning techniques with an UNBELIEVABLY SMALL BUDGET (or even FREE for public use). It can be SUPER COST-EFFECTIVE to take a small pre-trained model and further tune it for a very prescriptive task, for example, if you want to classify documents or search some very specific fields from a scanned invoice. These use cases produce responses in a pre-set format and thus are not prone to mis-speech. The small size also means LESS MAINTENANCE EFFORT AND FASTER RESPONSE TIME. In this article, I will delve deeper into the cost analysis and then give examples of use cases and techniques for effective fine-tuning.

Figure: Cost-efficient model fine-tuning. (Generated by stabilityai/stable-diffusion-3.5-large)

The cost of bigger models is both time and money

The big LLM models are getting smarter at an unprecedented speed; everybody is tip-toeing to catch up all the time. As an enterprise user, the general perception nowadays is to focus on building applications that consume cloud-provided APIs and let the models evolve through the APIs. There is no point competing with the big tech companies when it comes to training models. However, did you notice that the cost of LLM has skyrocketed from $5/Mtoken to $15/Mtoken (GPT-o1)? And if you’ve ever done a cost calculation for your company’s internal chatbot, that means many hundreds of thousands of dollars in the low estimation. If the bot is being used by “just” a few thousand internal employees who ended up loving the chatbot so much and keep bombarding the bot with questions, the cost will soon approach the low million dollars. Oh , and that is only for 1 application. If you have a few more AI products in the pipeline, be ready to pay millions more this year. So shall I say, “This is the year of million-dollar LLM bills?”

What cloud providers will never tell you is that big models come with more thinking and more overhead for parallelization, leading to even slower waiting time (oh, unless you can pay more!). In a normal condition, a large model may take 10 seconds to read long context and give an answer; however, when there are hundreds of people using it at the same time… ollama! And before you can realize it, your chatbot experience becomes a one-sided conversation, while your batch processing pipeline keeps failing overnight because “Token limit exceeded.” A good tip for companies here is that before engaging with an external LLM provider, please ask them how many tokens or requests per minute they can commit to for the listed cost, because, we all know, cheaper is not always better!

Understand when to buy…

Last year, when GPT-3 was evolving into GPT-4 then 4o/4omini, enterprises were generally advised to avoid fine-tuning, because hosting a fine-tuned model can cost 40,000 USD per year, plus the extra training time and resources. We instead focused on developing RAG chatbots. We tried to optimize chunking and ranking techniques to improve search accuracy. Retrieval was more important because as long as we can find the correct documents, the models are usually smart enough to provide near-accurate answers. In fact, the models are getting smarter at an incrediable speed, so many companies nowaday intend to call LLM through API and simply update the APIs whenever there is a new model.

Another concern for not hosting our own models is the guardrail and security concerns. In general, if you have a Q&A application, we have to screen the models for all sorts of unethical answers, plus industrial guidelines like “do not give tax advice” (or do not give out hints on hacking our company!) By outsourcing the model hosting, we can outsource the liability while waiting for the big tech companies to improve their big models for moral constraints.

And when to build

According to OpenAI Agent SDK guideline, we should try using small models for simpler tasks. Not every work requires the most intelligent model; for example, a simple retrieval or intent classification task can be completed by a smaller, faster model, while more complex tasks, such deciding whether to issue a refund, might require a more capable model. To find out if smaller models still get results that are acceptable, test them. In this way, we can determine where smaller models succeed or fail and prevent the agent’s talents from being pre-constrained.

In addition, to elliviate the needs for moral guardrails, we can choose use-cases where the response will follow a preset template or simple format. There are many techniques to force the model to generate tokens that is dictated in certain constraints. By using Huggingface Guidance, Outlines or Structured libraries, we can easily limit the response to follow a Regex or Json format with pre-determined field names. This way, the model will not have too much creativity and free-text to produce harmful speech.

Your toolkit for cost-efficient LLM tuning and hosting

a) Small models

Nowadays, even the small models are getting smarter as they are trained on trillions of tokens; some outperform the big models of the past. Microsoft’s Phi-3.5 model with 4B parameters outperforms Mistral-Nemo-12B, Llama-3.1-8B and Gemma-2-9B with about 8B parameters. Qwen2.5-VL-3B is ranked 6th on the leaderboard for Document Visual-Question-Answering, together with Snowflake’s Arctic TILT 0.8B parameters around the 10th spot, amidst other far bigger models.

b) Fine-tuning on a 16GB GPU

If we want to invest a bit more time on fine-tuning, small models with 8B parameters can now be easily tuned on a single 16GB GPU, which costs less than $1/hour on AWS (making only $200/month). A Llama-3.2-3B model for fine-tuning only occupies 4G memory using the QLoRa technique and 3G memory using Unsloth library.

For your perspective, in the past, each parameter used to be represented by a 32-bit floating-point number. To load 1 billion parameters, it used to take 4 GigaBytes (= 1B parameters * (32bits/parameter) / ( 8bits/byte) ) of memory. To fine-tune a model, one needs more memory for the tuning coefficients (see table below). All together, it takes 24 GB of GPU memory just to train a model with 1B parameters! That is crazy! That was why in the past, only deep-pocket companies could afford to train or pre-train models.

Figure: In the past, without optimization techniques, a 1B-parameter model in full precision will cost 4GB of GPU memory for inference, and 24GB of memory for fine-tuning.

As LLM boomed in 2023-2024, apparently it was faster and easier for big tech to throw money into training models before they even optimized the hardware limitations. There are still many rooms for parallelizing training between CPU and GPU, and these techniques only start making noise with the general audience this year. You might have heard that Meta took until Llama-4 to release their first MoE models, a bit behind as compared to other companies. Mixture-of-Expert can be understood as an approximation and parallelization technique. The transformer linear layer is split into many smaller chunks and distributed to different GPUs, so that each input can be routed to a different expert model.

Nowadays, there are many libraries for fine-tuning, all based on PyTorch, then adding layers of abstractions on top of each other. In the past, when I fine-tuned BERT, I used to write the training loop by myself. When I saw an “Out of memory ERROR”, I would manually implement a loop that clears the cache and retry again. Nowadays, there are so so many different models and tasks, each requiring different separation tokens and padding techniques. Luckily, the training technique is now handled inside open-source libraries, so we can experiment quickly and worry less about low-level implementation. Llama-Factory is built on libraries like DeepSpeed ZeRO and FSDP, which can efficiently parallelize at large scale. Hugging Face is a user-friendly library and ecosystem for models which can handle small models. Axolotl.ai provides a layer of abstraction on top of Hugging Face, just to simplify the coding further, at the expense of less flexibility. Unsloth is a powerful library written in Triton that optimizes for NVIDIA GPUs, which is “too good to be true but is still true”! A comparison between Hugging Face, LlamaFactory and Unsloth is shown below (). The general rule of thumb is to use Llama Factory for training large models that require multiple GPUs, or Unsloth for small and medium models that run on one single GPU.

Figure: A comparison between Hugging Face, LlamaFactory and Unsloth

c) Model inference on a CPU

After fine-tuning, there are techniques to save and serve the models with less memory as well. A 7B-model saved in GPTQ or AWQ quantization can be loaded with only 5GB of RAM for inference! vLLM can be used for inference with multi-GPUs. There are C++ libraries that help serve large models on CPU as well! Llama.cpp helps runs a wide range of models on CPU. Microsoft’s bitnet.cpp quantizes models to 1-bit (BitNet b1.58 2B4T model) and takes the inference game to a whole new level: 100B model can now be served on a CPU with 64GB memory at 6.68 tokens per second – on par with human reading speed. Personally, I see that a 64GB CPU costs the same as a 16GB GPU on AWS, so I think the BitNet library still has a long way to go. How about you? Did you try these techniques before? Comment below if you have any thoughts!

Final words

To sum up, models nowadays are getting so much smaller, faster, and easier for fine-tuning. Enterprises should start thinking seriously to identify use cases for the small models; otherwise, you could lose out in the AI race! Please invest in your own human resources and upskill your team rather than chasing third-party solutions. In the end, as technology is changing so fast, the most relevant skill set will still be the age-old “problem-solving skill,” which includes a strong basis in science and engineering, a product mindset, and an aptitude for lifelong learning!

How about you? What do you think about the balance between “Buy vs Build”? What types of use cases did you find helpful for business context? Please leave a comment below to share your experience

About the Author

Hung Do is a Senior Full-Stack Machine Learning Engineer, PhD with experience in multimodal data sources and real-time processing. She has been leading and delivering projects that produce sales uplifts and automated operational processes worth millions of dollars for businesses. Among her projects, she has developed AI digital assistants for Zurich Insurance, analyzed clinical data for major hospitals in Australia, and developed machine learning models for the EuroLeague – Europe’s premier basketball competition. She brings a wealth of international experience to her teams, having worked and studied in Singapore, France, Germany, Switzerland, and Australia.