DeTOX: A french-funded project on deepfake detection targeting important civilian and military personalities in France
Deepfake technology is an AI-software able to create a computer-generated version of a subject’s face, based on a source picture of the subject and a driving video. Deepfake technology has been advancing rapidly during the last few years, shifting the potential of these manipulated videos far from the original entertainment purposes. Although, deepfakes could be used to revolutionise media and film industries, they could be wrongly used for misinformation, mass manipulation and targeted people defamation. The threat of AI-generated images of people appearing to say or do something they didn’t is deeply disturbing, especially since this can be used for information warfare, as already happened recently within the war in Ukraine.
In this context, EURECOM’s professor Jean-Luc Dugelay, expert in face recognition and biometrics is collaborating with IRCAM, experts in speech and music processing in order to provide a new deepfake detection tool dedicated to the most important civilian and military personalities in France. In other words, their solution will be able to answer precisely and quickly a question such as “Is this video a deepfake of the President of France?”.
Q. What was the funding call you have responded to for the DeTOX project?
JLD. The funding call is a French ANR called ASTRID [1] (Accompagnement Specific de Travaux de Recherche et d’Innovation Defense), which is a specific call associated with the French Directorate General of Armaments. Within our project proposal, we were obliged to show that our results would have interest for both military and civilian applications. This is why we chose the field of deepfake, originally related to media, but more and more affecting politicians and even military conflicts. In this project we are 2 partners, EURECOM expert in videos and IRCAM expert in audio and the funding will last for 2 years, starting early 2023.
Q. What is the context and motivation of the project?
JLD. Deepfake is very difficult to detect automatically, especially if we don’t know in advance the technology, namely the model used for the deepfake generator. Some generators are already known through existing deepfake videos, but new ones are constantly created at the rate of every week or month. In our project, the novel approach we propose is the following. Instead of trying to classify generically if a certain video is fake, we will address the question if a certain video is a deepfake of a specific civilian or military personality of importance. In a nutshell, we propose a deepfake detector which would be speaker-dependent, dedicated to important personalities of France, such as politicians, military leaders and influential figures. This problem is indeed more limited while addressing a pool of predefined people, but the question is much more precise leading to better results than universal deepfake detectors.
Q. As you mentioned, deepfake adversary technologies are evolving extremely fast. What are the expected challenges regarding deepfake detection?
JLD. The novelty in our approach is to take into account the temporal aspect, like facial expressions, emotions etc. For example, lip synchronisation is very specific to individual speakers and therefore very difficult to manipulate by deepfake generators.
The important point is to obtain a very large database of real videos of the targeted speaker, so the model can obtain a very accurate training. The idea is to learn the temporal characteristics by feeding the neural network with a large input data of the targeted speaker.
At the current state of the art, most people analyse only a couple of static frames in order to detect deepfake videos. We will use classical detector tools from images and audio and on top of that we will have a module for the temporal synchronisation of both.
The actual amount of the necessary data to achieve the desired precision for deepfake detection is not yet clear and will be identified during the project.
We would need to define also which neural network architecture is more appropriate and also compare how much our speaker-dependent detector is more efficient compared to a generic one.
Q. How could your proposed deepfake detector be used in real-life?
JLD. Since our proposed deepfake detector is speaker-dependent, we would need to collect a large volume of videos and audios regarding the targeted personality. If the existing data are not sufficient, we could also produce the necessary data at a dedicated studio at IRCAM in Paris. For example, a large company’s CEO could be interested to have a protective tool for deepfake detection. Therefore, during our AI model’s training phase, we could record the person of interest saying some specific text, having specific facials expressions, using different languages etc. Once the model is trained, the tool could be available online and deepfake verification could be completed within seconds.
Right now, our proposed deepfake detector could not help preventing deepfake videos creation but it could help the media detect them very rapidly, before publishing their content.
Q. What is the future message for deepfake detection?
JLD. With DeTOX project, we aim to evolve towards the temporal dimension for deepfake detection, contributing to leverage the advantage against deepfake generators. We believe that a single generic deepfake detector is not an optimal solution. New deepfake generators are appearing constantly, but if we focus on a specific speaker, we will have a stable training sample impartial to time.
Our AI tool would be transferable to any speaker of interest, as long as a large training dataset would be available for them. However, for this specific call we will address very important civilian and military personalities of France.
As a novelty, we are taking the step for approaching deepfake detection by jointly considering audio and video synchronisation, which is currently an open question.
- DeTOX project is starting in the beginning of 2023 and 2 postdoc positions will open (1 at EURECOM and 1 at IRCAM), ideally for candidates with a background in deepfake, face recognition or audio signal processing.
REFERENCES