The face and voice of a person have unique characteristics and they are well used as biometric measures for person authentication either as a unimodal or multimodal. A strong correlation has been found between face and voice of a person, which has attracted significant research interest. Though previous works have established association between faces and voices, none of these approaches investigated the effect of multiple languages on this task. As half of the population of world is bilingual and we are more often communicating in multilingual scenarios, therefore, it is essential to investigate the effect of language for associating faces with the voices. Thus, the goal of the Face-voice Association in Multilingual Environments (FAME) 2026 challenge is to analyze the impact of multiple languages on face-voice association task. For more information on challenge please see evaluation plan.

DATASET

Our dataset comprises three versions: MAV-Celeb v1 and v3, each containing distinct (non-overlapping) speaker identities. MAV-Celeb v1 includes audiovisual data of speakers using English and Urdu, and v3 extends the dataset with English and German speakers.

The dataset is available at the following links:

To view the meta-data for the dataset, see the files below:

The file structure is as follows:

Dataset Structure




English-Urdu
English-Hindi
English-German
Languages
E/U/EU
E/H/EH
E/G/EG
# of celebrities
70
84
58
# of male celebrities
43
56
40
# of female celebrities
27
28
18
# of videos
560 / 406 / 966
546 / 668 / 1214
212 / 216 / 428
# of hours
59 / 32 / 91
48 / 60 / 109
8.2 / 6.6 / 14.8
# of utterances
11835 / 6550 / 18385
9974 / 13313 / 23287
2043 / 1769 / 3812
Avg # of videos per celebrity
8 / 6 / 14
6 / 8 / 14
3.7 / 3 / 7.7
Avg # of utterances per celebrity
169 / 94 / 263
119 / 158 / 277
35 / 31 / 66
Avg length of utterances (in s)
17.9 / 17.8 / 17.8
17.4 / 16.5 / 16.9
14.5 / 13.5 / 14.0

BASELINE MODEL

We provide a baseline model that has been trained on extracted features for facial and audio data (vggface for images and utterance level aggregator for voices). To learn a discriminative joint face-voice embedding for F-V association tasks, we develop a new framework for crossmodal face-voice association (See Fig. 1) that is fundamentally a two-stream pipeline and features a light-weight module that exploits complementary cues from both face and voice embeddings and facilitates discriminative identity mapping via orthogonality constraints

Link to the paper: Fusion and Orthogonal Projection for Improved Face-Voice Association
Link to the Paper's code: https://github.com/msaadsaeed/FOP
Link to the Baseline code: https://github.com/mavceleb/mavceleb_baseline

Figure 1: Diagram showing our methodology.

TASK

Cross-modal Verification

Face-voice association is established in cross-modal verification task. The goal of the cross-modal verification task is to verify if, in a given single sample with both a face and voice, both belong to the same identity. In addition, we analyze the impact of multiple of languages on cross-modal verification task.

Figure 2: (Left) F-V association is established with a cross-modal verification task. (Right) The FAME 2026 Challenge extends the task to analyze the impact of multiple languages.

EVALUATION METRICS

We are considering Equal Error Rate (EER) as the metric for evaluating the challenge performance. We expect the challenge participants to submit a output score file for every test pairs to indicate how confident the system believes to have a match between the face and voice or in other words, the face and voice belongs to the same person. The higher the score is, the larger is the confidence of being the face and voice from the same person. In real-world applications, people may set a threshold to determine the if the pair belongs to same or different person as binary output. With the threshold higher, the false acceptance rate (FAR) will become lower, and the false rejection rate (FRR) will become higher. The EER is that optial point when both the errors FAR and FRR are equal. Therefore, EER becomes suitable to evaluate the performance of systems than the conventional accuracy since it independent of the threshold. Finally, the lower the EER it can characterize a better system. For more information please see evaluation plan.

SUBMISSION

Within the directory containing the submission files, use zip archive.zip *.txt and do not zip the folder. Files should be named as:

Files are submitted through Codalab in the evaluation phase 3 times per day.

We provide both train and test splits for v3 of MAV-Celeb dataset. Participants can use this split for fine-tuning their method. However, for v1 the test files are in format as below:

We have kept the ground truth for fair evaluation during FAME challenge. Participants are expected to compute and submit a text file including the id and L2 Scores in the following format: The overall score will be computed as:

Overall Score = (Sum of all EERs) / 4


Link to Codalab: Codalab/Codabench

REGISTRATION

We welcome participants to apply for the “FAME Challenge 2026” by expressing their interest via google forms at this link.

For any queries please contact us at our email mavceleb@gmail.com.

TIMELINE

ORGANIZERS

Marta Moscati - Institute of Computational Perception, Johannes Kepler University Linz, Austria
Ahmed Abdullah - National University of Computer and Emerging Sciences, Pakistan
Muhammad Saad Saeed - University of Michigan, USA
Shah Nawaz - Institute of Computational Perception, Johannes Kepler University Linz, Austria
Rohan Kumar Das - Fortemedia Singapore, Singapore
Muhammad Zaigham Zaheer - Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates
Junaid Mir - University of Engineering and Technology Taxila, Pakistan
Muhammad Haroon Yousaf - University of Engineering and Technology Taxila, Pakistan
Khalid Malik - University of Michigan, USA
Markus Schedl - Institute of Computational Perception, Johannes Kepler University Linz, Austria | Human-centered AI Group, AI Lab, Linz Institute of Technology, Austria