Pre-training and fine-tuning of Large Language Models to obtain a foundation model for the Romanian language
February 2024 – February 2025
- Applied Data Science Center, University of Bucharest
- National University of Science and Technology Politehnica Bucharest
- Marius Popescu (co – Principal Investigator)
- Traian Rebedea (co – Principal Investigator)
- Mihai Mașala
- Ciprian Păduraru
- Horia Velicu
- Miruna Zăvelcă
ILDS Management Team:
This project is part of a larger project that aims at building a Large Language Model (LLM) for the Romanian language that can be adapted to a wide range of domains and use cases (i.e., foundation model). As approach for adapting to a specific domain, the larger project will focus on Retrieval-Augmented Generation (RAG) that combines information retrieval with text generation. It helps to provide more accurate and contextually relevant responses. As a use case, the focus will be on questions answering chat assistants. All these need a LLM with good capabilities.
Most LLM work tends to focus on English and most open models (such as Llama 2) have very limited coverage (and implicitly limited capabilities) of less popular languages such as Romanian. The challenging goal of this project is creating an effective LLM for Romanian. For this we will compile an extensive dataset for the Romanian language combining web crawls, news, social media, and ebooks, and will use this dataset for pretraining model and finetuning existing models (such as Llama 2) for Romanian. Another direction we will pursue is transfer learning by composing an anchor LLM model with a smaller domain-specific augmenting model to enable new capabilities (for example incorporating knowledge from Romanian jurBERT into a Romanian LLM model). We will try also to improve the LLM tokenizer by extending LLM’s existing vocabulary with additional tokens / embeddings from Romanian BERT, thereby improving its encoding efficiency and semantic understanding.
For models evaluation and comparation we will introduce a Romanian evaluation benchmark. To compare the performance of Romanian LLM with the performance of LLM models in others language we opted to translate an English relevant evaluation benchmark. One such benchmark is MT-Bench. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, it prompts strong LLMs such as GPT-4 to act as judges and assess the quality of the models’ responses. Other benchmarks we use to evaluate the quality of an LLM include ARC, HellaSwag, MMLU, TruthfulQA, Winograde, or GSM8k. As all these benchmarks are English-only, we resort to translating (both manually and with automated methods) all questions. This will allow us to properly evaluate the capabilities of a LLM specialized for Romanian.