Negar Foroutan

About Me

I am a PhD student at the School of Computer and Communication Sciences (IC), EPFL, Switzerland. I am a doctoral research assistant at NLP and LSIR labs under supervision of Prof. Antoine Bosselut and Prof. Karl Aberer. Currently, I am doing a 4-month research internship at Google Research, Zurich.

NEW: I’m on the job market, looking for a position starting in the Fall of 2025.

Research Interests

My research interests broadly encompass natural language processing (NLP) and machine learning, with a particular focus on improving the multilingual capabilities of large language models (LLMs), especially in low-resource settings. I work across the entire pipeline of training multilingual LLMs, including:

Pretraining Data Construction: Language identification, data filtering, and preprocessing to ensure high-quality datasets.
Multilingual Data Mixtures: Designing effective data strategies for balanced language representation.
Language-Aware Tokenization & Architectures: Developing multilingual tokenizers and LLMs that better handle low-resource languages.
Robust Multilingual Evaluation: Curating robust benchmarks to assess model performance across languages.

Education

PhD in Computer & Communication Sciences | EPFL, Switzerland, 2019 - 2025 (expected)

Thesis: Scaling Multilinguality: Addressing Low-Resource Language Limitations in Large Language Models
Advisors: Prof. Antoine Bosselut and Prof. Karl Aberer

MSc in Computer Engineering (Artificial Intelligence) | Shiraz University, Iran, 2013 - 2016

Thesis: Discovering the Hidden Structure of a Social Network: A Semi Supervised Approach
Advisor: Prof. Ali Hamzeh

BSc in Computer Engineering (Software Engineering) | Shiraz University, Iran, 2009 - 2013

Work Experience

Research Intern | Google Research, Zurich, Switzerland [January-April 2025]

Working on a project to optimize long-context inference, improving LLMs' efficiency in processing and understanding extended inputs.

Doctoral Research Assistant | EPFL, Lausanne, Switzerland [2019 - Present]

Contributed to multiple research projects on multilingual LLMs, covering the entire training pipeline. Led the multilingual effort within the SwissAI initiative. Collaborated on projects with Google, Cohere, and HuggingFace. Supervised junior researchers and summer interns. Served as a teaching assistant in several courses.

Scientific Assistant | Machine Learning and Optimization Laboratory, EPFL, Lausanne, Switzerland [Sept. 2018 - Sept. 2019]

I was involved in the mlbench project, a benchmark framework for distributed machine learning.

Research Intern | Data Analytics Laboratory, ETH, Zurich, Switzerland [May - July 201y]

As an intern in Thomas Hofmann's lab working under the supervision of Carsten Eickhoff, I worked on a modular, patient-centric information retrieval system designed for precision oncology applications. The result of the project was a submission to the TREC 2017 Precision Medicine track.

Research Intern | Max Planck Institute for Software Systems, Kaiserslautern, Germany [February - April 2017]

As an intern under the supervision of Manuel Gomez Rodriguez, I worked on a project analyzing the dynamics of citation networks: quantifying the value of a set of published papers and modeling knowledge diffusion across a citation network.

R&D Engineer | Center of Intelligent Vision & Image Processing, Shiraz University, Shiraz, Iran [2016 - 2017]

I worked on projects focused on object detection, facial expression analysis, and real-time face recognition and tracking.

Publications

ACL'25

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Negar Foroutan, Angelika Romanou, Matin Ansaripour, Julian Martin Eisenschlos, Karl Aberer, Rémi Lebret

Annual Meeting of the Association for Computational Linguistics (ACL), 2025 - Findings.

PDF Code

arXiv

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

arXiv preprint, 2025.

PDF Code

arXiv

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf

arXiv preprint, 2025.

PDF Code

ICLR'25

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, et al.

International Conference on Learning Representations (ICLR), 2025.

PDF Project Page Spotlight

ACL'25

How Do Multilingual Models Remember? Investigating Multilingual Factual Recall Mechanisms

Constanza Fierro, Negar Foroutan, Desmond Elliott, Anders Søgaard

Annual Meeting of the Association for Computational Linguistics (ACL), 2025 - Findings.

PDF

PNAS'24

Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants

Beatriz Borges*, Negar Foroutan*, Deniz Bayazit*, Anna Sotnikova*, et al.

Proceedings of the National Academy of Sciences (PNAS), 2024.

PDF Code

EMNLP'24

Discovering Knowledge-Critical Subnetworks in Pretrained Language Models

Deniz Bayazit, Negar Foroutan, Zeming Chen, Gail Weiss, Antoine Bosselut

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.

PDF

EMNLP'23

Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention

Negar Foroutan, Mohammadreza Banaei, Karl Aberer, Antoine Bosselut

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.

PDF Code

ACL'23

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Yasmine Karoui, Rémi Lebret, Negar Foroutan, Karl Aberer

Annual Meeting of the Association for Computational Linguistics (ACL), 2023.

PDF Code

EMNLP'22

Discovering Language-neutral Sub-networks in Multilingual Language Models

Negar Foroutan, Mohammadreza Banaei, Remi Lebret, Antoine Bosselut, Karl Aberer

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.

PDF Code

Multilingual Text Summarization on Financial Documents

Negar Foroutan, Angelika Romanou, Stéphane Massonnet, Rémi Lebret, Karl Aberer

Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022.

PDF

Sparse Communication for Training Deep Networks

Negar Foroutan, Martin Jaggi.

Workshop on "Beyond first-order methods in ML systems" at ICML, 2020.

PDF

TCSS'17

Discovering the Hidden Structure of a Social Network: A Semi Supervised Approach

Negar Foroutan, Ali Hamzeh

IEEE Transactions on Computational Social Systems.

PDF

ETH Zurich at TREC Precision Medicine 2017

Negar Foroutan, Jannick Griner, Nicolas Mesot, Leandro von Werra and Carsten Eickhoff

TREC Precision Medicine 2017.

PDF

ICCKE'15

A lightweight method to investigate unknown social network structure

Negar Foroutan, Ardavan Afshar, Bahareh Ashenagar, Ali Hamzeh

5th International Conference on Computer and Knowledge Engineering (ICCKE).

PDF