Uncategorized

Best Generative AI models for multilingual reasoning in 2024: Llama 3.1 70B vs Gemini 1.5 Pro vs Claude 3.5 vs GPT-4o vs Qwen 2.5

September 29, 2024 SwapBrain

1. Introduction

In recent years, Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks, leading to increasing adoption in diverse applications, including customer service and online marketing.

One particularly promising application is in automating responses to user inquiries on company pages across multiple platforms. This task presents a significant challenge as it requires understanding diverse questions, potentially in multiple languages, and generating accurate and informative responses. Manually addressing these inquiries can be resource-intensive, highlighting the need for efficient and effective automated solutions.

However, the effectiveness of LLMs in this context can vary significantly depending on their ability to handle multilingualism and complex reasoning. Different LLMs may exhibit varying strengths and weaknesses in understanding user demands, generating appropriate responses, and subtly conveying the company’s information.

This blog aims to provide valuable insights into the capabilities of LLMs for automating responses to user inquiries in a multilingual context. Our findings will be beneficial for businesses seeking to leverage LLMs for efficient and effective customer interaction on their online platforms.

2. Related work

Recent research has demonstrated that Large Language Models (LLMs) exhibit remarkable performance across various natural language processing tasks, even rivaling or surpassing human capabilities in certain domains (Brown et al., 2020; Radford et al., 2019). However, the accuracy and robustness of LLMs in real-world scenarios, particularly in specialized domains like industry and medicine, remain an open question.

In the medical domain, Zhu et al. (2024) demonstrated a significant performance gap between Chinese and Western-developed LLMs when answering questions related to Traditional Chinese Medicine (TCM). Their study revealed that Chinese LLMs achieved an average accuracy of 78.4% on the TCM medical licensing examination, while Western LLMs only reached 35.9%. This result highlights the impact of language and cultural bias on LLM performance, especially in domains requiring specialized knowledge and cultural understanding. For instance, Qwen-max (Alibaba), a Chinese LLM, achieved the highest accuracy of 86.4% while all Western models failed to meet the passing threshold of 60% accuracy. This suggests that LLMs trained predominantly on English data may struggle to understand and apply domain-specific concepts from different cultures and languages. [1]

In the industrial domain, Li et al. (2024) conducted a comprehensive empirical study on the accuracy and robustness of LLMs in Chinese industrial scenarios, using 1200 domain-specific problems from 8 different industrial sectors. They found that the accuracy of all evaluated LLMs, including GPT-3.5, remained insufficient (below 60%) for deployment in industrial applications, with the highest accuracy reaching only 0.59 for GPT-4 and the lowest being 0.33. This study also revealed that the robustness of LLMs, assessed using a metamorphic testing framework on 13,631 questions, varied significantly across different abilities. Global LLMs exhibited higher robustness (averaging 0.76 for GPT-4) under logical-related variants, while advanced local LLMs performed better on problems related to understanding Chinese industrial terminology. [2]

Furthermore, specialized benchmarks have been developed to evaluate the reasoning capabilities of LLMs in specific domains. For example, the MGSM (Multilingual Grade School Math) benchmark from Papers With Code [3] provides a collection of grade school math problems translated into 10 languages. This benchmark allows for the assessment of LLMs’ ability to understand natural language and perform logical reasoning in a multilingual mathematical context.

These studies emphasize the importance of considering cultural and linguistic contexts when evaluating and developing LLMs. For multilingual reasoning tasks, it is crucial to assess LLM performance across different languages and domains. Furthermore, it is necessary to investigate various factors that can impact LLM robustness, such as model architecture, training data, and fine-tuning strategies.

3. Experiment Methodology

3.1 Input data

At our company, SwapBrain, we provide a Chrome extension that allows people to learn Chinese and Spanish more easily by having dual subtitles. Assume there will be 8 posts in Chinese in the Chinese learning group and 2 posts in Spanish in the Spanish learning group. We will have three criticals aspect that we think a post highly potential for marketing engagement:

Study resources: Posts asking for recommendations on Chinese language learning resources (textbooks, apps, or online platforms) are ideal for introducing your app as a helpful tool.
Learning Challenge: Posts discussing specific challenges faced by Chinese and Spanish learners in mastering the Chinese language or Spanish language can be a great source of engagement.
Reaction and posts length: Posts must have high engagement and must be at least 50 words long.

The list of posts is presented in the table below.

No.Post	Chinese Posts
1	你好！我是Mikaela!I am looking for recommendations on books/apps/programs on how to learn Mandarin Chinese.Our goal is to visit Taiwan in about 6 years and I would like to be able to speak relatively well and understand relatively well by then!Any and all suggestions would be helpful.I understand some now and I know how to speak some now (very butchered in pronouncing) but can
2	Hi! Do you have any app recommendations that I can use in studying Chinese? I also learned that I can’t seem to really learn the language if I’ll just focus on pinyin, so I planned to study both pinyin and the characters at the same time. I love reading Chinese novels (but i only read the translated ones)
3	Hello everyone, please recommend me an app like Duolingo where learning chinese is free.
4	What is the fastest way to learn Chinese?There’s no magic formula for instantaneous fluency in Chinese but there are strategies to expedite the learning process. Let’s try some tips that may help you learn Chinese more efficiently 1. Immerse Yourself in Chinese: The fastest way to become proficient in Chinese is to immerse yourself in an environment where only Chinese is spoken. If anyone here is a fan of Lisa Blackpink, you’ve probably heard about her journey learning Korean. Lisa was placed in an environment where everyone communicated solely in Korean. From there, she must quickly become accustomed to listening and mimicking speech patterns. Similarly, with Chinese, being exposed to an environment where only Chinese is spoken is the best way to rapidly improve your Chinese proficiency. However, not all of us have the opportunity to go to China to study Chinese. Therefore, we need to create a Chinese language environment for ourselves. Listening to Chinese music, watching Chinese movies are things you can do to immerse yourself in Chinese. Or if those aren’t your interests, try reading Chinese books, or finding a Chinese friend to chat with daily. You can find a Chinese friend right in this group or check out an app I often use called HelloTalk. This app helps connect people to support each other through conversation, such as a Chinese person wanting to learn your native language.2. Focusing on high-frequency vocabulary: High-frequency vocabulary consists of words and phrases that are commonly used in everyday communication. By prioritizing these words, you can quickly acquire language skills that are immediately applicable in real-life situations. This efficiency maximizes the return on investment for time spent studying. You can start with the simplest level of words and gradually increase the difficulty. I find vocabulary lists categorized by HSK levels, ranging from HSK 1 to HSK 6, to be a convenient resource for learning. The lower the level of a word, the more frequently we encounter and use it in daily life. To effectively memorize these vocabulary lists, I study them through flashcards integrated with illustrated images, audio, and interactive games that guide me in writing Chinese characters. These flashcards are readily available, totaling around 5000 flashcards corresponding to the vocabulary from HSK 1 to HSK 6 in my Chinese vocabulary learning app called Mochi Chinese. It also helps me memorize what I’ve learned effectively by analyzing my study history and notifying me of optimal review times through reminder notifications sent to my phone.3. Practice regularly to create a habit: Consistency is key. Set aside dedicated time each day to practice Chinese. Even short, daily practice sessions can yield significant progress over time. Thanks to Mochi Chinese’s regular review reminders, I’ve been able to maintain a streak of over 50 consecutive days of studying. Typically, after about 21 days of repeated actions, we begin to form a new habit. Now, studying Chinese vocabulary has become very natural for me. There’s no need to muster up much motivation to start each day; it’s become a habit to simply open my phone and see if there are any new words to review or to choose a new lesson to learn.Create image by Linguisticjourney.
5	I need to pass HSK 3. But my Chinese is zero. Is it necessary to start my preparation from HSK1 level Chinese and so on or I can directly start my journey from HSK3 level. Tell me any free online resources where I can prepare my exam and if someone willing to learn English in return he/she will teach me Chinese.
6	As a beginner learning Mandarin Chinese, there are three things that you need to know:Characters and Pinyin: Learning Chinese characters and Pinyin is the foundation of learning Mandarin. Characters are the writing system of Chinese, while Pinyin is a system for representing the pronunciation of Chinese characters. Understanding characters and Pinyin can help you accurately read and write Chinese words and sentences.Basic grammar: Learning basic grammar in Mandarin Chinese is essential. You need to master the basic sentence structures and common sentence patterns, such as affirmative sentences, negative sentences, and interrogative sentences. This can help you construct basic sentences and understand the basic language rules of Chinese.Common vocabulary and phrases: Learning some common vocabulary and phrases can help you quickly enter the Chinese language environment and start using Chinese for simple communication. For example, learning numbers, time, greetings, colors, body parts, family relationships, etc.By learning these three aspects, you can lay a solid foundation for learning Mandarin and gradually master more Chinese knowledge.
7	Practice makes perfect. Iam learning to write Chinese characters now. Iam so sorry for my bad handwriting. The book I used is for beginners Chinese level. From the simplest to the most difficult characters. This book (book 1) consits of 131 characters (you can see the table of content). The book was published in Indonesia and the language used is also Indonesian, but if you’re interested to have the pdf file, I will make it and finish the English translation for you as soon as possible . Well, I just can say “NEVER STOP LEARNING AND NEVER GIVE UP”.
8	What is the most difficult language to learn in the world? 世界上最难学的语言是什么？ shì jiè shàng zuì nán xué de yǔ yán shì shén me?

No.Post	Spanish Posts
9	Hello, I’m an intermediate learner.I found that most of the teaching apps like Doulingo and Memrise are teaching words and solid expressions and they were really helpful in the pronunciation and vocab. But I still struggle in listening and building phrases on my own and breakdown complex grammer phrases or advanced tenses structures.I’m wondering if there is any app that will help me tackle this issues. ( It’s ok if it’s paid )
10	¿Cómo puedo aprender el idioma español fácilmente cuando no tengo a nadie con quien hablar y practicar español

3.2 Prompt Design

After we have the dataset, we need to design a prompt that allows the models to understand the requirements we set so that we can receive very good responses. In our view, a very good response should meet the following criteria:

It must be relevant to the post.
It should answer in the most natural way.
It should be polite when mentioning the company’s website and Facebook page.
The response should be limited to 100-150 words.

3.3 Access Method

In order to test on the models, we need to have the API keys for those models. However, some models require us to pay for their use, which is why we can also perform tests directly in the browser.

4. Experimental Result

Before diving into the evaluation, I want to highlight that this assessment is inherently subjective. We could have employed alternative methods like Accuracy, BLEU, F1 Score, or Perplexity. However, we chose human evaluation to make the results more relatable and understandable for the reader.

We will be evaluating responses based on three key criteria:

Fluency – Is the response grammatically correct and fluent?
Relevance – Does the response address the question or request in the post?
Helpfulness – Does the response provide subtly information about our company?
Compliance – Does the response satisfy the requirements of the prompt?

Each sentence will be scored from 1 to 5 for each of these criteria, with the final score being an average over 10 sentences.

	Fluency	Relevance	Helpfulness	Compliance
GPT-4o Model	5.0	5.0	3.8	5.0
Llama 3-70B Model	5.0	3.9	3.2	5.0
Gemini 1.5 Pro Model	5.0	4.4	4.0	5.0
Claude 3.5 Model	5.0	5.0	3.3	5.0
Qwen 2.5 -72B Model	5.0	5.0	3.1	5.0

Fluency: All models scored perfectly (5.0), showing that the state of generative AI in 2024 has reached a high standard in producing grammatically correct, fluent text across languages. This is largely due to large-scale pre-training on diverse multilingual datasets.

Relevance: GPT-4o, Claude 3.5, and Qwen 2.5 excel in consistently addressing the context of the prompts with highly relevant answers. Gemini 1.5 also performs well but slightly trails in this area. Llama 3-70B, while strong, occasionally offers less specific answers, leading to a slightly lower relevance score.

Helpfulness: This is where the differences become more apparent. Gemini 1.5 Pro leads with a strong score (4.0) in incorporating subtle, company-related information into its responses, making it ideal for tasks involving brand promotion or delivering company-specific messaging. GPT-4o and Claude 3.5 also perform decently, but their focus is more on general information, which slightly reduces their effectiveness in offering promotional or context-specific content. Llama 3-70B and Qwen 2.5 score lower in this category, as their answers tend to be less aligned with subtle company-related integration.

Compliance: All models achieve perfect scores (5.0) for compliance, demonstrating that they consistently adhere to the requirements of the prompt. This shows the maturity of these models in understanding and meeting task expectations without deviating from the core instructions.

5. Conclusion

In summary, GPT-4o and Claude 3.5 stand out as highly reliable for general multilingual reasoning tasks, excelling in fluency and relevance. Gemini 1.5 Pro is the best choice when subtle company-related information needs to be integrated, offering a balance of fluency, relevance, and helpfulness. Qwen 2.5 and Llama 3-70B are strong performers, but their lower helpfulness scores make them better suited for tasks requiring broader, more neutral responses. Each model’s strengths can be leveraged depending on the specific requirements of a multilingual AI application.

References

[1] Zhu, L., Mou, W., Lai, Y., Lin, J., & Luo, P. (2024). Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models. Journal of Translational Medicine, 22(1), 319.

[2] Li, Z., Qiu, W., Ma, P., Li, Y., Li, Y., He, S., Jiang, B., Wang, S., & Gu, W. (2024). An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios. arXiv preprint arXiv:2401.07529.

[3] MGSM Dataset – Papers With Code: https://paperswithcode.com/dataset/mgsm

From Internet:

https://integrail.ai/blog/best-open-source-llm

https://www.datacamp.com/blog/top-open-source-llms

https://www.cybrosys.com/blog/5-top-open-source-llms-for-2024

https://blogs.novita.ai/beginners-guide-claude-3-5-api-vs-llama-3-1-405b-api

ttps://aclanthology.org/2024.acl-long.604.pdf