Skip to content
Institute for Logic and Data Science
Menu
  • Home
  • Research
    • Research Projects
    • Scientific Seminars
  • Events
  • People
  • Fellowships
  • Partners
  • About
    • About Us
    • Support us
    • Executive Board
    • Contact
Menu

LLM for Romanian

Pre-training and fine-tuning of Large Language Models to obtain a foundation model for the Romanian language

February 2024 – February 2025
Research Partners:
  • BRD
  • Applied Data Science Center, University of Bucharest
  • National University of Science and Technology Politehnica Bucharest
Research Team:
  • Marius Popescu (co – Principal Investigator)
  • Traian Rebedea (co – Principal Investigator)
  • Dragoș Corlătescu
  • Mihai Dascălu
  • Alexandru Dima
  • Denis Ilie-Ablachim
  • Mihai Mașala
  • Ciprian Păduraru
  • Horia Velicu
  • Miruna Zăvelcă
ILDS Management Team:
  • Laurențiu Leuștean
  • Alin Ștefănescu
Description:

This project is part of a larger project that aims at building a Large Language Model (LLM) for the Romanian language that can be adapted to a wide range of domains and use cases (i.e., foundation model). As approach for adapting to a specific domain, the larger project will focus on Retrieval-Augmented Generation (RAG) that combines information retrieval with text generation. It helps to provide more accurate and contextually relevant responses. As a use case, the focus will be on questions answering chat assistants. All these need a LLM with good capabilities.

Most LLM work tends to focus on English and most open models (such as Llama 2) have very limited coverage (and implicitly limited capabilities) of less popular languages such as Romanian. The challenging goal of this project is creating an effective LLM for Romanian. For this we will compile an extensive dataset for the Romanian language combining web crawls, news, social media, and ebooks, and will use this dataset for pretraining model and finetuning existing models (such as Llama 2) for Romanian. Another direction we will pursue is transfer learning by composing an anchor LLM model with a smaller domain-specific augmenting model to enable new capabilities (for example incorporating knowledge from Romanian jurBERT into a Romanian LLM model). We will try also to improve the LLM tokenizer by extending LLM’s existing vocabulary with additional tokens / embeddings from Romanian BERT, thereby improving its encoding efficiency and semantic understanding.

For models evaluation and comparation we will introduce a Romanian evaluation benchmark. To compare the performance of Romanian LLM with the performance of LLM models in others language we opted to translate an English relevant evaluation benchmark. One such benchmark is MT-Bench. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, it prompts strong LLMs such as GPT-4 to act as judges and assess the quality of the models’ responses. Other benchmarks we use to evaluate the quality of an LLM include ARC, HellaSwag, MMLU, TruthfulQA, Winograde, or GSM8k. As all these benchmarks are English-only, we resort to translating (both manually and with automated methods) all questions. This will allow us to properly evaluate the capabilities of a LLM specialized for Romanian.

Results:

The technical report may be found here: https://arxiv.org/abs/2405.07703

The model may be downloaded here: https://huggingface.co/OpenLLM-Ro

The underlying code may be downloaded here: https://github.com/OpenLLM-Ro

Press release (Romanian) for the launch of the first models

“Vorbeşti Româneşte?” A Recipe to Train Powerful Romanian LLMs with English Instructions (research paper)

OpenLLM-Ro — community website

Workshop: LLMs for Romanian (28 September 2024)

Follow us

Subscribe to our RSS feed.

Subscribe

Support us

Looking for ways to support our research? Check out all the different opportunities!

Contact us

Interested in logic and/or data science research? Send an email to contact@ilds.ro

Institute for Logic and Data Science
Str. Popa Tatu nr. 18
010805 Bucharest, Romania
contact@ilds.ro
  

© 2025 Institute for Logic and Data Science | Powered by Minimalist Blog WordPress Theme