The face and voice of a person have unique characteristics and they are well used as biometric measures for person authentication either as a unimodal or multimodal. A strong correlation has been found between face and voice of a person, which has attracted significant research interest. Though previous works have established association between faces and voices, none of these approaches investigated the effect of multiple languages on this task. As half of the population of world is bilingual and we are more often communicating in multilingual scenarios, therefore, it is essential to investigate the effect of language for associating faces with the voices. Thus, the goal of the Face-voice Association in Multilingual Environments (FAME) 2024 challenge is to analyze the impact of multiple languages on face-voice association task. For more information on challenge please see evaluation plan.

CHALLENGE WINNERS

Results are now announced. Congratulations to the winners!

Rank Team Name Primary Contact Affiliation Score (EER) Sytem Description Report
1 HLT Tao Ruijie National University of Singapore 19.91 Click Here!
2 Audio_Visual Wuyang Chen National University of Defense Technology 20.51 Click Here!
3 Xaiofei Tang Jie Hui Hefei University of Technology 21.76 Click Here!

DATASET

Our dataset comprises of two versions, MAV-CELEB v1 and MAV-CELEB v2 both containing different (non-inclusive) speaker identities. The v1 contains audio visual data of speakers with Urdu and English languages while the v2 contains speakers with Hindi and English languages.

The dataset is available on the following links:

To view the meta-data for the dataset, you can view the PDFs attached below:

The file structure is like:

MAV-CELEB v1
MAV-CELEB v2
Languages
E / U / EU
E / H / EH
# of celebs
70
84
# of male celebs
43
56
# of female celebs
27
28
# of videos
402 / 555 / 957
646 / 484 1130
# of hours
30 / 54 / 84
51 / 33 / 84
# of utterances
6850 / 12706 / 19556
12579 / 8136 / 20715
Avg# of videos per celebrity
6 / 8 / 14
8 / 6 / 14
Avg# of utterances per celebrity
98 / 182 / 280
150 / 97 / 247
Avg length of utterance(s)
15.8 / 15.3 / 15.6
14.6 / 14.6 / 14.6

BASELINE MODEL

We provide a baseline model that has been trained on extracted features for facial and audio data (vggface for images and utterance level aggregator for voices). To learn a discriminative joint face-voice embedding for F-V association tasks, we develop a new framework for crossmodal face-voice association (See Fig. 1) that is fundamentally a two-stream pipeline and features a light-weight module that exploits complementary cues from both face and voice embeddings and facilitates discriminative identity mapping via orthogonality constraints

Link to the paper: Fusion and Orthogonal Projection for Improved Face-Voice Association
Link to the Paper's code: https://github.com/msaadsaeed/FOP
Link to the Baseline code: https://github.com/mavceleb/mavceleb_baseline

Figure 1: Diagram showing our methodology.

TASK

Cross-modal Verification

Face-voice association is established in cross-modal verification task. The goal of the cross-modal verification task is to verify if, in a given single sample with both a face and voice, both belong to the same identity. In addition, we analyze the impact of multiple of languages on cross-modal verification task.

Figure 2: Diagram explaining cross-modal verification and matching task in face-voice association.

EVALUATION METRICS

We are considering Equal Error Rate (EER) as the metric for evaluating the challenge performance. We expect the challenge participants to submit a output score file for every test pairs to indicate how confident the system believes to have a match between the face and voice or in other words, the face and voice belongs to the same person. The higher the score is, the larger is the confidence of being the face and voice from the same person. In real-world applications, people may set a threshold to determine the if the pair belongs to same or different person as binary output. With the threshold higher, the false acceptance rate (FAR) will become lower, and the false rejection rate (FRR) will become higher. The EER is that optial point when both the errors FAR and FRR are equal. Therefore, EER becomes suitable to evaluate the performance of systems than the conventional accuracy since it independent of the threshold. Finally, the lower the EER it can characterize a better system. For more information please see evaluation plan.

SUBMISSION

Within the directory containing the submission files, use zip archive.zip *.txt and do not zip the folder. Files should be named as:

Files are submitted through Codalab in the evaluation phase 3 times per day.

We provide both train and test splits for v2 of MAV-Celeb dataset. Participants can use this split for fine-tuning their method. However, for v1 the test files are in format as below:

We have kept the ground truth for fair evaluation during FAME challenge. Participants are expected to compute and submit a text file including the id and L2 Scores in the following format: The overall score will be computed as:

Overall Score = (Sum of all EERs) / 4


Link to Codalab: Codalab

REGISTRATION

We welcome participants to apply for the “FAME Challenge 2024” by expressing their interest via google forms at this link.

For any queries please contact us at our email mavceleb@gmail.com.

TIMELINE

ORGANIZERS

Muhammad Saad Saeed - Swarm Robotics Lab (SRL)-NCRA, University of Engineering and Technology Taxila
Shah Nawaz - Institute of Computational Perception, Johannes Kepler University Linz, Austria
Muhammad Salman Tahir - Swarm Robotics Lab (SRL)-NCRA, University of Engineering and Technology Taxila
Rohan Kumar Das - Fortemedia Singapore, Singapore
Muhammad Zaigham Zaheer - Mohamed bin Zayed University of Artificial Intelligence
Marta Moscati - Institute of Computational Perception, Johannes Kepler University Linz, Austria
Markus Schedl - Institute of Computational Perception, Johannes Kepler University Linz, Austria | Human-centered AI Group, AI Lab, Linz Institute of Technology, Austria
Muhammad Haris Khan - Mohamed bin Zayed University of Artificial Intelligence
Karthik Nandakumar - Mohamed bin Zayed University of Artificial Intelligence
Muhammad Haroon Yousaf - Swarm Robotics Lab (SRL)-NCRA, University of Engineering and Technology Taxila