Category: Tech

“This is the year of agents… and cost-effective fine-tuning!”

2025 is the year of agents!”—This is what I heard repeatedly during cloud events. And as soon as I walk out of the conference rooms, I see everybody rushes to build agents with “Einstein-level reasoning”. The models are getting smarter at an incredible speed, so why not simply make API calls to the cloud providers and let the big models upgrade by themselves? However, do you ever notice: the more parameters are added to these LLM models, the more expensive they cost and the slower they become? Did you ever try calculating the API cost and end up with a MILLION-DOLLAR BUDGET???

Here is what cloud providers never tell you: This year also witnesses the maturity of fine-tuning techniques with an UNBELIEVABLY SMALL BUDGET (or even FREE for public use). It can be SUPER COST-EFFECTIVE to take a small pre-trained model and further tune it for a very prescriptive task, for example, if you want to classify documents or search some very specific fields from a scanned invoice. These use cases produce responses in a pre-set format and thus are not prone to mis-speech. The small size also means LESS MAINTENANCE EFFORT AND FASTER RESPONSE TIME. In this article, I will delve deeper into the cost analysis and then give examples of use cases and techniques for effective fine-tuning.

Figure: Cost-efficient model fine-tuning. (Generated by stabilityai/stable-diffusion-3.5-large)

The cost of bigger models is both time and money

The big LLM models are getting smarter at an unprecedented speed; everybody is tip-toeing to catch up all the time. As an enterprise user, the general perception nowadays is to focus on building applications that consume cloud-provided APIs and let the models evolve through the APIs. There is no point competing with the big tech companies when it comes to training models. However, did you notice that the cost of LLM has skyrocketed from $5/Mtoken to $15/Mtoken (GPT-o1)? And if you’ve ever done a cost calculation for your company’s internal chatbot, that means many hundreds of thousands of dollars in the low estimation. If the bot is being used by “just” a few thousand internal employees who ended up loving the chatbot so much and keep bombarding the bot with questions, the cost will soon approach the low million dollars. Oh , and that is only for 1 application. If you have a few more AI products in the pipeline, be ready to pay millions more this year. So shall I say, “This is the year of million-dollar LLM bills?”

What cloud providers will never tell you is that big models come with more thinking and more overhead for parallelization, leading to even slower waiting time (oh, unless you can pay more!). In a normal condition, a large model may take 10 seconds to read long context and give an answer; however, when there are hundreds of people using it at the same time… ollama! And before you can realize it, your chatbot experience becomes a one-sided conversation, while your batch processing pipeline keeps failing overnight because “Token limit exceeded.” A good tip for companies here is that before engaging with an external LLM provider, please ask them how many tokens or requests per minute they can commit to for the listed cost, because, we all know, cheaper is not always better!

Understand when to buy…

Last year, when GPT-3 was evolving into GPT-4 then 4o/4omini, enterprises were generally advised to avoid fine-tuning, because hosting a fine-tuned model can cost 40,000 USD per year, plus the extra training time and resources. We instead focused on developing RAG chatbots. We tried to optimize chunking and ranking techniques to improve search accuracy. Retrieval was more important because as long as we can find the correct documents, the models are usually smart enough to provide near-accurate answers. In fact, the models are getting smarter at an incrediable speed, so many companies nowaday intend to call LLM through API and simply update the APIs whenever there is a new model.

Another concern for not hosting our own models is the guardrail and security concerns. In general, if you have a Q&A application, we have to screen the models for all sorts of unethical answers, plus industrial guidelines like “do not give tax advice” (or do not give out hints on hacking our company!) By outsourcing the model hosting, we can outsource the liability while waiting for the big tech companies to improve their big models for moral constraints.

And when to build

According to OpenAI Agent SDK guideline, we should try using small models for simpler tasks. Not every work requires the most intelligent model; for example, a simple retrieval or intent classification task can be completed by a smaller, faster model, while more complex tasks, such deciding whether to issue a refund, might require a more capable model. To find out if smaller models still get results that are acceptable, test them. In this way, we can determine where smaller models succeed or fail and prevent the agent’s talents from being pre-constrained.


In addition, to elliviate the needs for moral guardrails, we can choose use-cases where the response will follow a preset template or simple format. There are many techniques to force the model to generate tokens that is dictated in certain constraints. By using Huggingface Guidance, Outlines or Structured libraries, we can easily limit the response to follow a Regex or Json format with pre-determined field names. This way, the model will not have too much creativity and free-text to produce harmful speech.

 

Your toolkit for cost-efficient LLM tuning and hosting

a) Small models

Nowadays, even the small models are getting smarter as they are trained on trillions of tokens; some outperform the big models of the past. Microsoft’s Phi-3.5 model with 4B parameters outperforms Mistral-Nemo-12B, Llama-3.1-8B and Gemma-2-9B with about 8B parameters. Qwen2.5-VL-3B is ranked 6th on the leaderboard for Document Visual-Question-Answering, together with Snowflake’s Arctic TILT 0.8B parameters around the 10th spot, amidst other far bigger models.

b) Fine-tuning on a 16GB GPU

If we want to invest a bit more time on fine-tuning, small models with 8B parameters can now be easily tuned on a single 16GB GPU, which costs less than $1/hour on AWS (making only $200/month). A Llama-3.2-3B model for fine-tuning only occupies 4G memory using the QLoRa technique and 3G memory using Unsloth library.

For your perspective, in the past, each parameter used to be represented by a 32-bit floating-point number. To load 1 billion parameters, it used to take 4 GigaBytes (= 1B parameters * (32bits/parameter) / ( 8bits/byte) ) of memory. To fine-tune a model, one needs more memory for the tuning coefficients (see table below). All together, it takes 24 GB of GPU memory just to train a model with 1B parameters! That is crazy! That was why in the past, only deep-pocket companies could afford to train or pre-train models.

Figure: In the past, without optimization techniques, a 1B-parameter model in full precision will cost 4GB of GPU memory for inference, and 24GB of memory for fine-tuning.

As LLM boomed in 2023-2024, apparently it was faster and easier for big tech to throw money into training models before they even optimized the hardware limitations. There are still many rooms for parallelizing training between CPU and GPU, and these techniques only start making noise with the general audience this year. You might have heard that Meta took until Llama-4 to release their first MoE models, a bit behind as compared to other companies. Mixture-of-Expert can be understood as an approximation and parallelization technique. The transformer linear layer is split into many smaller chunks and distributed to different GPUs, so that each input can be routed to a different expert model.

Nowadays, there are many libraries for fine-tuning, all based on PyTorch, then adding layers of abstractions on top of each other. In the past, when I fine-tuned BERT, I used to write the training loop by myself. When I saw an “Out of memory ERROR”, I would manually implement a loop that clears the cache and retry again. Nowadays, there are so so many different models and tasks, each requiring different separation tokens and padding techniques. Luckily, the training technique is now handled inside open-source libraries, so we can experiment quickly and worry less about low-level implementation. Llama-Factory is built on libraries like DeepSpeed ZeRO and FSDP, which can efficiently parallelize at large scale. Hugging Face is a user-friendly library and ecosystem for models which can handle small models. Axolotl.ai provides a layer of abstraction on top of Hugging Face, just to simplify the coding further, at the expense of less flexibility. Unsloth is a powerful library written in Triton that optimizes for NVIDIA GPUs, which is “too good to be true but is still true”! A comparison between Hugging Face, LlamaFactory and Unsloth is shown below (). The general rule of thumb is to use Llama Factory for training large models that require multiple GPUs, or Unsloth for small and medium models that run on one single GPU. 

Figure: A comparison between Hugging Face, LlamaFactory and Unsloth

c) Model inference on a CPU

After fine-tuning, there are techniques to save and serve the models with less memory as well. A 7B-model saved in GPTQ or AWQ quantization can be loaded with only 5GB of RAM for inference! vLLM can be used for inference with multi-GPUs. There are C++ libraries that help serve large models on CPU as well! Llama.cpp helps runs a wide range of models on CPU. Microsoft’s bitnet.cpp quantizes models to 1-bit (BitNet b1.58 2B4T model) and takes the inference game to a whole new level: 100B model can now be served on a CPU with 64GB memory at 6.68 tokens per second – on par with human reading speed. Personally, I see that a 64GB CPU costs the same as a 16GB GPU on AWS, so I think the BitNet library still has a long way to go. How about you? Did you try these techniques before? Comment below if you have any thoughts!

Final words

To sum up, models nowadays are getting so much smaller, faster, and easier for fine-tuning. Enterprises should start thinking seriously to identify use cases for the small models; otherwise, you could lose out in the AI race! Please invest in your own human resources and upskill your team rather than chasing third-party solutions. In the end, as technology is changing so fast, the most relevant skill set will still be the age-old “problem-solving skill,” which includes a strong basis in science and engineering, a product mindset, and an aptitude for lifelong learning!

How about you? What do you think about the balance between “Buy vs Build”? What types of use cases did you find helpful for business context? Please leave a comment below to share your experience

About the Author

Hung Do is a Senior Full-Stack Machine Learning Engineer, PhD with experience in multimodal data sources and real-time processing. She has been leading and delivering projects that produce sales uplifts and automated operational processes worth millions of dollars for businesses. Among her projects, she has developed AI digital assistants for Zurich Insurance, analyzed clinical data for major hospitals in Australia, and developed machine learning models for the EuroLeague – Europe’s premier basketball competition. She brings a wealth of international experience to her teams, having worked and studied in Singapore, France, Germany, Switzerland, and Australia.

Chinese Language Learning app

The best way to learn a language is by imitation. So with PinyinTube, you can enjoy your favorite movie while learning any language. 

This instruction video shows how to add PinyinTube to your Chrome extensions.

Learn More

Apply Quantum Clustering to preprocess movies subtitles

Introduction

PinyinTube is a Chrome extension that allows users to enjoy their favorite movies while learning languages immersively at the same time. In addition to displaying dual subtitles, the app also allows users to pause the video and replay the conversation sentence by sentence to practice speaking alongside the actors.

During the development of this application, we encountered difficulties with the alignment of subtitles, or with multiple actors speaking at the same time. In addition, there are also unimportant subtitles that describes noise or actions, which we will henceforth refer to as the “background subtitles”. Having noticed this issue, we took several measures to rectify it. In order to realign the subtitles, we can implement a highly efficient phoneme alignment model that utilized a two-layer LSTM-RNN architecture [1]. However, before that, we have to pre-process the subtitles to remove background subtitles on the fly while the user is watching the movie. To overcome this, we employed the use of clustering methods that were capable of grouping similar subtitles in real-time. The clustering method is highly effective in helping to eliminate outliers and accurately label subtitles that are in the same cluster. Through this methodology, we were able to significantly improve the quality of subtitle alignment and enhance the overall user experience of our application.

Quantum Clustering

Clustering algorithms aim to partition a dataset into groups or clusters, where data points within the same cluster are more similar to each other than to those in other clusters. Quantum clustering algorithms are a type of algorithm that leverage the principles of quantum mechanics to perform clustering tasks. The idea behind these algorithms is to exploit quantum phenomena, such as superposition and entanglement, to enhance the efficiency and effectiveness of clustering.

 

Quantum clustering algorithms are an active area of research within the field of quantum computing and have the potential to offer advantages over classical clustering algorithms in terms of computational efficiency and accuracy, particularly for large and complex datasets. Since Quantum hardware is currently still limited by processing power and requires improvement in error correction techniques, in this project, we will focus on quantum-inspired clustering algorithms that can run on the classical computers.

The most popular conventional clustering model is often referred to as the Parzen window estimator, where every data point is associated with a Gaussian kernel to approximate the probability density function. There is only one single parameter: the width (sigma) of the Gaussian function. In contrast, in quantum clustering, every data point is associated with a vector in the Hilbert space. By applying the Schrodinger equation, we can solve for the potential function. It has been shown that the quantum potential function can show the underlying structure of the data, where the minima indicate the centers of the clusters [2].

In the experiment, the data sets used are either text documents or oral conversation sentences. To extract the feature vector for each data point, the X^2 score is first calculated, then Principal Component Analysis (PCA) is used to reduce the dimension of the feature vector to only two. The F1 score is used to combine the precision and recall metrics. To estimate the parameter sigma, instead of using the popular statistical approach of the k-nearest neighbors (KNN) method, an easy method called Pattern Search is deployed. From the experiments, quantum clustering shows a higher F1 score than the traditional clustering method in identifying the topics of the different text data points. Additionally, the model can be applied to identify clusters of different writers of literature documents [3].

From the paper [3], we are confident that quantum clustering can be used to cluster movie subtitles into different types and actor source. After clustering, we can eliminate the clusters with fewer members, which are probably the “background subtitles,” and keep only a few large clusters. The remaining clusters can represent subtitles from the few different main characters. The labels from clustering can be carried forward to the next AI model that extracts actor’s voice from the background.

References

[1] Schulza-Forster et al, “JOINT PHONEME ALIGNMENT AND TEXT-INFORMED SPEECH SEPARATION ON HIGHLY CORRUPTED SPEECH“, Conference Proceeding at ICAASSP 2020. 

[2]  D. Horn, A. Gottlieb, “Algorithm for data clustering in pattern recognition
problems based on quantum mechanics”, Physical Review Letters, 2002.

[3] Ding Liu et al, “Analyzing documents with Quantum Clustering: A novel pattern

recognition algorithm based on quantum mechanics”, Pattern recognition Letter, 2016.

Learn More

The AI models that power PinyinTube’s voice and subtitle extraction.

Introduction

PinyinTube is an exceptional tool that offers an unparalleled language learning experience through movies. This innovative Chrome extension promises to make language learning an unforgettable experience by providing an immersive learning experience. The dual subtitles feature on this language learning platform ensures that all levels of learners can grasp the content easily. Moreover, the Romanized Chinese Pronunciation makes it really easy for users to learn how to speak Chinese words like a native. The interactive design on this platform allows users to pause and replay the content as much as they want, making it a perfect tool for practice sessions with actors. If you’re passionate about taking your language learning up a notch, be sure to upgrade to the PRO version. The PRO version makes it possible to record your voice, compare your tone and pronunciation to that of the native actors, and track your progress. Learning a new language has never been this fun!

misaligned audio screenshot of app

While creating this extension, we have had to overcome multiple hurdles that seemed daunting initially. However, we are planning to surmount them thanks to our technical expertise and the remarkable capabilities of AI. One of the most significant issues we encountered was that the subtitles were often misaligned from the actual audio that was being played, making it exceedingly tough to replay separate sentences precisely. Additionally, we noticed that the actor’s voice was often lost in the background noise and music, coupled with multiple actors speaking simultaneously or mumbling, which made it even trickier to extract their voice and match it with the user’s recorded voice. These challenges could have potentially hindered our ability to deliver the best possible output; however, our team was undeterred and instead chose to deploy a series of cutting-edge AI models, developed in a carefully drafted sequential pattern:

AI roadmap

– First, the technique of Quantum Clustering is used to group together different types of speech in the subtitles, such as the dialogue of different characters, background noise, and general description. This clustering process allows for the filter to be applied to only the speech of the main characters. In particular, we will apply the method introduced by Ding Liu in 2016 [1].

– Secondly, the voices of the main characters are aligned with the corresponding subtitles through a phoneme method developed by Schulze-Forster in 2020 [2].

– Using the labelled clusters and subtitles, the voices of the main actors can be separated from the background noise by applying the text-informed sound separation method developed by Kevin Kilgour and others from Google Research in 2022 [3].

– However, this may often result in a corrupted and unclear audio. To enhance the audio, generative AI techniques are suggested, which were developed by Pascual in 2017 [4].

Due to the technical nature of the above topics, we will write separate blog posts to discuss the in-depth technical details. Please follow the hyperlink on each topic to go to the corresponding pages.

References

[1] Ding Liu et al, “Analyzing documents with Quantum Clustering: A novel pattern

recognition algorithm based on quantum mechanics”, 2016, Pattern Recognition Letters

[2] Kilian Schulze-Forster et al, “JOINT PHONEME ALIGNMENT AND TEXT-INFORMED SPEECH SEPARATION ON HIGHLY CORRUPTED SPEECH”, 2020, conference proceeding at ICASSP 2020

[3] Kevin Kilgour et al, “Text-driven separation of arbitrary sounds”, 2022, Conference proceeding at Interspeech 2022

[4] Santiago Pascual et al, “SEGAN: Speech Enhancement Generative Adversarial Network”, Conference proceeding at Interspeech 2017

Learn More

Application of Large Language Models (LLM) to subtitle alignment and actor’s voice isolation

PRICE SLASHED! New Year Promotion (15 Jan -15 Feb)10$   5$ for Premium Version
DOWNLOAD FREE VERSION NOW

In practical movies, the subtitle is usually misaligned with the actor’s actual voice. In addition, the actor’s voice is often masked by heavy background noises and music. It was a challenge for our application to replay the actor’s voice exactly at the subtitle of interest, or to score the user’s voice against the actor’s voice. In PinyinTube PRO Version, we apply Deep Machine Learning Models to align the subtitles and extract the voice from the background voice. This was done by applying the cutting-edge research paper “Joint Phoneme Alignment and Text-Informed Speech Separation on Highly Corrupted Speech” by Schulze-Forster et al. However, since the paper was written in 2020 and some codes were outdated, we updated the code and made it available to the public from our GitLab repository here. Please feel free to clone our work and send us your feedback by different communication channels:
– Write a comment on this blog.
– Send a message using our contact form.
– Write a thread on our forum.
– Send an email to our address admin@swapbrain.com

 

 

 

Learn More

PinyinTube: Beyond a traditional translator

In the previous blog, we introduced PinyinTube, its purpose, and its mechanism to help SwapBrain users enjoy not only the best Chinese movies and videos experience, but also act as a useful Mandarin learning tool. In this post, we will go deeper inside the technology of PinyinTube and each role of PinyinTube’s task set.

1. A good listener

First, let’s dive into how PinyinTube records the actors’ voices and other audio snippets. Unlike Google Live Caption, which requires an inherent transcript set for each video, PinyinTube can access the microphone of each user’s computer and do the recording from microphone input. When the users press the microphone button, their voice will be recorded and stored in the software storage. This way, the users can playback their voice and compare with the actor’s voice. This can done multiple times, allowing the users to improve their speaking skills. PinyinTube’s live caption is built based on a JavaScript API called Web Speech API. Once installed into the JavaScript code snippet of the extension, this API will enable PinyinTube to collect the live audio the users.

pinyintube record button
Record your voice and replay to compare with the actor's voice.

2. A good translator

Now we have partially understood how PinyinTube retrieves the audio snippets and convert them to letter form. Yet, how can those words be converted to other languages like English to Pinyin or Mandarin to English? Interestingly, this all starts with how human beings pick up a new language.

Learning a new language demands time and effort. The first few months will prepare learners so that they can be involved in primary or intermediate conversations. With further learning and relentless practice, after a year or more, fluent daily communication with native people should not be a problem. This incredible progress is thanks to our neural system’s training and learning development. As we maintain frequent language exposure, our brain begins to “adapt” to new words, expressions, and phonetics. This way, hence, has been adopted to design one of the highest technological platforms, the Deep Neural Network.

Deep Neural Network is a subset of the Machine Learning field inspired by the human brain’s biological operation via trillions of neuron cells. Thousands of companies have recently applied Machine Learning to their tech product and SwapBrain is no exception. As an AI/Automation consulting company, we focus not only on AI services but also on Deep Neural Network application development. PinyinTube will be our firstborn using Machine Learning technology to be a great translating tool. To do that, the extension will process the caption input, push it through multiple layers consisting of cells in the Neural Network, and calculate the gradient before printing the output to the screen. This is the whole process for PinyinTube to translate Chinese/English subtitles into the user’s preferred language.

Deep Neural Network is the future of SwapBrain

3. A good supporter

With the capability of a live translator, PinyinTube also has some other notable features that SwapBrain customers, can take advantage of at no cost. Once installed and activated, the extension will automatically translate Chinese or English audio into your desired language. The caption will be provided right above the video’s subtitle, along with multiple interactive popup buttons that serve your needs. As you may not know, PinyinTube is uniquely built to link each audio sentence with its live caption. Thus, you can both hear a sentence spoken by the actor and read the translation script simultaneously. Additionally, you can click the forward/backward buttons to see the next/previous sentence. PinyinTube will receive the order to roll over on any sentence you choose and also jump to the corresponding fragment. For users who wish to utilize the extension as a learning tool, the playback speed button is adjustable so as to match your reading and listening level.

PinyinTube will also be a good language mentor. Isn't that awesome?

However, more advanced items can only be accessed through a pro subscription such as  “Anywhere Captions” and “Voice Comparison”. In particular, there is an extra charge for the extension to be used on other streaming websites (except Youtube and Netflix). With the latter, PinyinTube can also record your own audio to get tested and corrected against the actor’s voice. This will help you compare your phonetics with the standard actor’s voice in the video so that you can improve your language skills. Although PinyinTube is accessible to everyone, we believe spending a little extra would be largely beneficial for specific customers who are interested in learning Chinese or English to enrich your multilingual base.

4. What in your thought?

Now you have comprehended a set of stunning works we have been doing to bring you, our precious customers, the best version of a live translator and learning assistant. In the next few months, we are going to publish PinyinTube to the Chrome Web Store and serve you the first product that we have been doing our best on it. If you want to experience our MVP, please enter your email at the bottom of this page. If you have any feedback or suggestion abourt features that the extension is having, please let us know by commenting down below or send us a message via our contact page. Any comment will be a great contribution to the improvement and growth of SwapBrain and PinyinTube.

Let’s keep blogging and cheers!

Learn More

PinyinTube: A promising rise of a translator

1. From daily hobby to a start-up project

SwapBrain’s CEO and founder, Ms. Hung Do, developed an interest in learning natural languages. Apart from her mother tongue, Vietnamese, she can speak English, French, and German considerably well. Plus, she picked up a basic accent of Chinese during her time in Singapore, which is now her favorite foreign language.

Throughout the years, Ms. Do has developed her most vital skill: Machine Learning System Design. She has built various real-time Machine Learning applications that were used widely in many industries. The learning and working experience gave her insight into the future of Machine Learning in the world, which she combined with her hobby of learning Chinese. It dawned on her the idea of building a startup that enables her to make her way in the internal development of apps, which can be extremely helpful shortly when AI dominates the market. When the idea came to reality, PinyinTube is the firstborn product of the company.

2. Streaming experience refresher

PinyinTube is a promising Chrome Extension on Google Chrome that can translate Chinese subtitles into other languages for international viewers who have limited English reading skills and vice versa. For those who can speak Chinese but only in Pinyin format, PinyinTube, as the name is, also runs the Pinyin caption right below the English subtitle. Hence, the application targets specifically both Chinese and English-speaking communities.

Youtube_chinese_french
Pinyin can link Mandarin a character with its pronouncitation and tone

According to CEO Do, nowadays every single video on any streaming platform has English subtitles, regardless of their origins. This is, indeed, because English is the most popular language all around the world. Yet this hinders non-English speakers to enjoy the streaming time. Chinese people are heavily affected by this drawback since only 1% of them speak English. Strangely, their native language which remains the second most popular globally, Mandarin, receives much less interest to receive dedicated subtitling applications. It is surprisingly uncommon to see Mandarin appear on the subtitle option list, not to mention Pinyin. This is inexplicable as it does not require much effort to include Pinyin in the subtitle list when it is a friendlier romanization version of Mandarin to non-native speakers. Therefore, PinyinTube is created to solve the problem. 

With PinyinTube, a movie lover yourself can widen your video watching options to a new range. Once installed in your web browser, it will bring you a whole new experience when you still can understand Chinese videos with both English and Pinyin captions at the bottom,  without having to pause and translate Chinese captions manually. Even for English-based videos, Pinyin still appears on your screen to help you learn English for easier catching-up too.

3. What in your thought?

PinyinTube is a project that we, SwapBrain members, have been investing time and resources on to hopefully break down the boundary between Chinese and English learners. Moreover, we are expanding the project’s covered language beyond only Pinyin. Soon, we will add more languages such as French and German to the options list so users can choose any of their preferred ones for the best streaming experience. When that dream comes true, we may get PinyinTube a new name since its call will not fit anymore ^^.

As a startup, your feedbacks are extremely important for us to modify our project to fit your needs. Do you have any suggestions on what this extension can do? Please leave a comment below and we will be right here to listen to you. One more time, please visit the landing page for more information and waiting for our debut.

Cheers and let’s keep blogging!

Learn More

Verified by MonsterInsights