Posts

Audio and machine learning: Gaël Richard’s award-winning project

Gaël Richard, a researcher in Information Processing at Télécom Paris, has been awarded an Advanced Grant from the European Research Council (ERC) for his project entitled HI-Audio. This initiative aims to develop hybrid approaches that combine signal processing with deep machine learning for the purpose of understanding and analyzing sound.

Artificial intelligence now relies heavily on deep neural networks, which have a major shortcoming: they require very large databases for learning,” says Gaël Richard, a researcher in Information Processing at Télécom Paris. He believes that “using signal models, or physical sound propagation models, in a deep learning algorithm would reduce the amount of data needed for learning while still allowing for the high controllability of the algorithm.” Gaël Richard plans to pursue this breakthrough via his HI-Audio* project, which won an ERC Advanced Grant on April 26, 2022

For example, the integration of physical sound propagation models can improve the characterization and configuration of the types of sound analyzed and help to develop an automatic sound recognition system. “The applications for the methods developed in this project focus on the analysis of music signals and the recognition of sound scenes, which is the identification of the recording’s sound environment (outside, inside, airport) and all the sound sources present,” Gaël Richard explains.

Industrial applications

Learning sound scenes could help autonomous cars identify their surroundings. The algorithm would be able to identify the surrounding sounds using microphones. The vehicle would be able to recognize the sound of a siren and its variations in sound intensity. Autonomous cars would then be able to change lanes to let an ambulance or fire engine pass, without having to “see” it in the detection cameras. The processes developed in the HI-Audio project could be applied to many other areas. The algorithms could be used in predictive maintenance to control the quality of parts in a production line. A car part, such as a bumper, is typically controlled based on the sound resonance generated when a non-destructive impact is applied.

The other key applications for the HI-Audio project are in the field of AI for music, particularly to assist musical creation by developing new interpretable methods for sound synthesis and transformation.

Machine learning and music

One of the goals of this project is to build a database of music recordings from a wide variety of styles and different cultures,” Gaël Richard explains. “This database, which will be automatically annotated (with precise semantic information), will expand the research to include less studied or less distributed music, especially from audio streaming platforms,” he says. One of the challenges of this project is that of developing algorithms capable of recognizing the words and phrases spoken by the performers, retranscribing the music regardless of its recording location, and contributing new musical transformation capabilities (style transfer, rhythmic transformation, word changes).

One important aspect of the project will also be the separation of sound sources,” Gaël Richard says. In an audio file, the separation of sources, which in the case of music are each linked to a different instrument, is generally achieved via filtering or “masking”. The idea is to hide all other sources until only the target source remains. One less common approach is to isolate the instrument via sound synthesis. This involves analyzing the music to characterize the sound source to be extracted in order to reproduce it. For Gaël Richard, “the advantage is that, in principle, artifacts from other sources are entirely absentIn addition, the synthesized source can be controlled by a few interpretable parameters, such as the fundamental frequency, which is directly related to the sound’s perceived pitch,” he says. “This type of approach opens up tremendous opportunities for sound manipulation and transformation, with real potential for developing new tools to assist music creation,” says Gaël Richard.

*HI-Audio will start on October 1st, 2022 and will be funded by the ERC Advanced Grant for five years for a total amount of €2.48 million.

Rémy Fauvel

Gaël richard

Gaël Richard, IMT-Académie des sciences Grand Prix

Speech synthesis, sound separation, automatic recognition of instruments or voices… Gaël Richard‘s research at Télécom Paris has always focused on audio signal processing. The researcher has created numerous acoustic signal analysis methods, thanks to which he has made important contributions to his discipline. These methods are currently used in various applications for the automotive and music industries. His contributions to the academic community and technology transfer have earned him the 2020 IMT-Académie des sciences Grand Prix

Your early research work in the 1990s focused on speech synthesis: why did you choose this discipline?

Gaël Richard: I didn’t initially intend to become a researcher; I wanted to be a professional musician. After my baccalaureate I focused on classical music before finally returning to scientific study. I then oriented my studies toward applied mathematics, particularly audio signal processing. During my Master’s internship and then my PhD, I began to work on speech and singing voice synthesis. In the early 1990s, the first perfectly intelligible text-to-speech systems had just been developed. The aim at the time was to achieve a better sound quality and naturalness and to produce synthetic voices with more character and greater variability.

What research have you done on speech synthesis?

GR: To start with, I worked on synthesis based on signal processing approaches. The voice is considered as being produced by a source – the vocal cords – which passes through a filter – the throat and the nose. The aim is to represent the vocal signal using the parameters of this model to either modify a recorded signal or generate a new one by synthesis. I also explored physical modeling synthesis for a short while. This approach consists in representing voice production through a physical model: vocal cords are springs that the air pressure acts on. We then use fluid mechanics principles to model the air flow through the vocal tract to the lips.

What challenges are you working on in speech synthesis research today?

GR: I have gradually extended the scope of my research to include subjects other than speech synthesis, although I continue to do some work on it. For example, I am currently supervising a PhD student who is trying to understand how to adapt a voice to make it more intelligible in a noisy environment. We are naturally able to adjust our voice in order to be better understood when surrounded by noise. The aim of his thesis, which he is carrying out with the PSA Group, is to change the voice of a radio, navigation assistant (GPS) or telephone, initially pronounced in a silent environment, so that it is more intelligible in a moving car, but without amplifying it.

As part of your work on audio signal analysis, you developed different approaches to signal decomposition, in particular those based on “non-negative matrix factorization”. It was one of the greatest achievements of your research career, could you tell us what’s behind this complex term?

GR: The additive approach, which consists in gradually adding the elementary components of the audio signal, is a time-honored method. In the case of speech synthesis, it means adding simple waveforms – sinusoids – to create complex or rich signals. To decompose a signal that we want to study, such as a natural singing voice, we can logically proceed the opposite way, by taking the starting signal and describing it as a sum of elementary components. We then have to say which component is activated and at what moment to recreate the signal in time.

The method of non-negative matrix factorization allows us to obtain such a decomposition in the form of the multiplication of two matrices: one matrix represents a dictionary of the elementary components of the signal, and the other matrix represents the activation of the dictionary elements over time. When combined, these two matrices make it possible to describe the audio signal in mathematical form. “Non-negative” simply means that each element in these matrices is positive, or that each source or component contributes positively to the signal.

Why is this signal decomposition approach so interesting?

GR: This decomposition is very efficient for introducing initial knowledge into the decomposition. For example, if we know that there is a violin, we can introduce this knowledge into the dictionary by specifying that some of the elementary atoms of the signal will be characteristic of the violin. This makes it possible to refine the description of the rest of the signal. It is a clever description because it is simple in its approach and handling as well as being useful for working efficiently on the decomposed signal.

This non-negative matrix factorization method has led you to subjects other than speech synthesis. What are its applications?

GR: One of the major applications of this technique is source separation. One of our first approaches was to extract the singing voice from polyphonic music recordings. The principle consists in saying that, for a given source, all the elementary components are activated at the same time, such as all the harmonics of a note played by an instrument, for example. To simplify, we can say that non-negative matrix factorization allows us to isolate each note played by a given instrument by representing them as a sum of elementary components (certain columns of the “dictionary” matrix) which are activated over time (certain lines of the “activation” matrix). At the end of the process, we obtain a mathematical description in which each source has its own dictionary of elementary sound atoms. We can then replay only the sequence of notes played by a specific instrument by reconstructing the signal by multiplying the non-negative matrices and setting to zero all note activations that do not correspond to the instrument we want to isolate.

What new prospects can be considered thanks to the precision of this description?

GR: Today, we are working on “informed” source separation which incorporates additional prior knowledge about the sources in the source separation process. I currently co-supervise a PhD student who is using the knowledge of lyrics to help the separation of the isolate singing voices. There are multiple applications: from automatic karaoke generation by removing the detected voice, to music and movie sound track remastering or transformation. I have another PhD student whose thesis is on isolating a singing voice using the simultaneously recorded electroencephalogram (EEG) signal. The idea is to ask a person to wear an EEG cap and focus their attention on one of the sound sources. We can then obtain information via the recorded brain activity and use it to improve the source separation.

Your work allows you to identify specific sound sources through audio signal processing… to the point of automatic recognition?

GR: We have indeed worked on automatic sound classification, first of all through tests on recognizing emotion, particularly fear or panic. The project was carried out with Thales to anticipate crowd movements. Besides detecting emotion, we wanted to measure the rise or fall in panic. However, there are very few sound datasets on this subject, which turned out to be a real challenge for this work. On another subject, we are currently working with Deezer on the automatic detection of content that is offensive or unsuitable for children, in order to propose a sort of parental filter service, for example. In another project on advertising videos with Creaminal, we are detecting key or culminating elements in terms of emotion in videos in order to automatically propose the most appropriate music at the right time.

On the subject of music, is your work used for automatic song detection, like the Shazam application?

GR: Shazam uses an algorithm based on a fingerprinting principle. When you activate it, the app records the audio fingerprint over a certain time. It then compares this fingerprint with the content of its database. Although very efficient, the system is limited to recognizing completely identical recordings. Our aim is to go further, by recognizing different versions of a song, such as live recordings or covers by other singers, when only the studio version is saved in the memory. We have filed a patent on a technology that allows us to go beyond the initial fingerprint algorithm, which is too limited for this kind of application. In particular, we are using a stage of automatic estimation of the harmonic content, or more precisely the sequences of musical chords. This patent is at the center of a start-up project.

Your research is closely linked to the industrial sector and has led to multiple technology transfers. But you also have made several freeware contributions for the wider community.

GR: One of the team’s biggest contributions in this field is the audio extraction software YAAFE. It’s one of my most cited articles and a tool that is regularly downloaded, despite the fact that it dates from 2010. In general, I am in favor of the reproducibility of research and I publish the algorithms of work carried out as often as possible. In any case, it is a major topic of the field of AI and data science, which are clearly following the rise of this discipline. We also make a point of publishing the databases created by our work. That is essential too, and it’s always satisfying to see that our databases have an important impact on the community.