Speech synthesis, sound separation, automatic recognition of instruments or voices… Gaël Richard‘s research at Télécom Paris has always focused on audio signal processing. The researcher has created numerous acoustic signal analysis methods, thanks to which he has made important contributions to his discipline. These methods are currently used in various applications for the automotive and music industries. His contributions to the academic community and technology transfer have earned him the 2020 IMT-Académie des sciences Grand Prix
Your early research work in the 1990s focused on speech synthesis: why did you choose this discipline?
Gaël Richard: I didn’t initially intend to become a researcher; I wanted to be a professional musician. After my baccalaureate I focused on classical music before finally returning to scientific study. I then oriented my studies toward applied mathematics, particularly audio signal processing. During my Master’s internship and then my PhD, I began to work on speech and singing voice synthesis. In the early 1990s, the first perfectly intelligible text-to-speech systems had just been developed. The aim at the time was to achieve a better sound quality and naturalness and to produce synthetic voices with more character and greater variability.
What research have you done on speech synthesis?
GR: To start with, I worked on synthesis based on signal processing approaches. The voice is considered as being produced by a source – the vocal cords – which passes through a filter – the throat and the nose. The aim is to represent the vocal signal using the parameters of this model to either modify a recorded signal or generate a new one by synthesis. I also explored physical modeling synthesis for a short while. This approach consists in representing voice production through a physical model: vocal cords are springs that the air pressure acts on. We then use fluid mechanics principles to model the air flow through the vocal tract to the lips.
What challenges are you working on in speech synthesis research today?
GR: I have gradually extended the scope of my research to include subjects other than speech synthesis, although I continue to do some work on it. For example, I am currently supervising a PhD student who is trying to understand how to adapt a voice to make it more intelligible in a noisy environment. We are naturally able to adjust our voice in order to be better understood when surrounded by noise. The aim of his thesis, which he is carrying out with the PSA Group, is to change the voice of a radio, navigation assistant (GPS) or telephone, initially pronounced in a silent environment, so that it is more intelligible in a moving car, but without amplifying it.
As part of your work on audio signal analysis, you developed different approaches to signal decomposition, in particular those based on “non-negative matrix factorization”. It was one of the greatest achievements of your research career, could you tell us what’s behind this complex term?
GR: The additive approach, which consists in gradually adding the elementary components of the audio signal, is a time-honored method. In the case of speech synthesis, it means adding simple waveforms – sinusoids – to create complex or rich signals. To decompose a signal that we want to study, such as a natural singing voice, we can logically proceed the opposite way, by taking the starting signal and describing it as a sum of elementary components. We then have to say which component is activated and at what moment to recreate the signal in time.
The method of non-negative matrix factorization allows us to obtain such a decomposition in the form of the multiplication of two matrices: one matrix represents a dictionary of the elementary components of the signal, and the other matrix represents the activation of the dictionary elements over time. When combined, these two matrices make it possible to describe the audio signal in mathematical form. “Non-negative” simply means that each element in these matrices is positive, or that each source or component contributes positively to the signal.
Why is this signal decomposition approach so interesting?
GR: This decomposition is very efficient for introducing initial knowledge into the decomposition. For example, if we know that there is a violin, we can introduce this knowledge into the dictionary by specifying that some of the elementary atoms of the signal will be characteristic of the violin. This makes it possible to refine the description of the rest of the signal. It is a clever description because it is simple in its approach and handling as well as being useful for working efficiently on the decomposed signal.
This non-negative matrix factorization method has led you to subjects other than speech synthesis. What are its applications?
GR: One of the major applications of this technique is source separation. One of our first approaches was to extract the singing voice from polyphonic music recordings. The principle consists in saying that, for a given source, all the elementary components are activated at the same time, such as all the harmonics of a note played by an instrument, for example. To simplify, we can say that non-negative matrix factorization allows us to isolate each note played by a given instrument by representing them as a sum of elementary components (certain columns of the “dictionary” matrix) which are activated over time (certain lines of the “activation” matrix). At the end of the process, we obtain a mathematical description in which each source has its own dictionary of elementary sound atoms. We can then replay only the sequence of notes played by a specific instrument by reconstructing the signal by multiplying the non-negative matrices and setting to zero all note activations that do not correspond to the instrument we want to isolate.
What new prospects can be considered thanks to the precision of this description?
GR: Today, we are working on “informed” source separation which incorporates additional prior knowledge about the sources in the source separation process. I currently co-supervise a PhD student who is using the knowledge of lyrics to help the separation of the isolate singing voices. There are multiple applications: from automatic karaoke generation by removing the detected voice, to music and movie sound track remastering or transformation. I have another PhD student whose thesis is on isolating a singing voice using the simultaneously recorded electroencephalogram (EEG) signal. The idea is to ask a person to wear an EEG cap and focus their attention on one of the sound sources. We can then obtain information via the recorded brain activity and use it to improve the source separation.
Your work allows you to identify specific sound sources through audio signal processing… to the point of automatic recognition?
GR: We have indeed worked on automatic sound classification, first of all through tests on recognizing emotion, particularly fear or panic. The project was carried out with Thales to anticipate crowd movements. Besides detecting emotion, we wanted to measure the rise or fall in panic. However, there are very few sound datasets on this subject, which turned out to be a real challenge for this work. On another subject, we are currently working with Deezer on the automatic detection of content that is offensive or unsuitable for children, in order to propose a sort of parental filter service, for example. In another project on advertising videos with Creaminal, we are detecting key or culminating elements in terms of emotion in videos in order to automatically propose the most appropriate music at the right time.
On the subject of music, is your work used for automatic song detection, like the Shazam application?
GR: Shazam uses an algorithm based on a fingerprinting principle. When you activate it, the app records the audio fingerprint over a certain time. It then compares this fingerprint with the content of its database. Although very efficient, the system is limited to recognizing completely identical recordings. Our aim is to go further, by recognizing different versions of a song, such as live recordings or covers by other singers, when only the studio version is saved in the memory. We have filed a patent on a technology that allows us to go beyond the initial fingerprint algorithm, which is too limited for this kind of application. In particular, we are using a stage of automatic estimation of the harmonic content, or more precisely the sequences of musical chords. This patent is at the center of a start-up project.
Your research is closely linked to the industrial sector and has led to multiple technology transfers. But you also have made several freeware contributions for the wider community.
GR: One of the team’s biggest contributions in this field is the audio extraction software YAAFE. It’s one of my most cited articles and a tool that is regularly downloaded, despite the fact that it dates from 2010. In general, I am in favor of the reproducibility of research and I publish the algorithms of work carried out as often as possible. In any case, it is a major topic of the field of AI and data science, which are clearly following the rise of this discipline. We also make a point of publishing the databases created by our work. That is essential too, and it’s always satisfying to see that our databases have an important impact on the community.