A Multistream Multimodal Foundation Model for Real-Time Voice-Based Application

A new Ellis Turin Talk, collaborating with the Artificial Intelligence Hub of the Politecnico di Torino and the Vandal Research Group of the Department of Control and Computer Engineering-DAUIN, titled 'A multistream multimodal foundation model for real-time voice-based application' will be held online on 12 May, with speaker Patrick Pérez.

Abstract
A unique way for humans to seamlessly exchange information and emotion, speech should be a key means for us to communicate with and through machines. This is not yet the case. In an effort to progress toward this goal, we introduce a versatile speech-text decoder-only model that can serve a number of voice-based applications. It has in particular allowed us to build Moshi, the first-ever full-duplex spoken-dialogue system (with no latency and no imposed speaker turns) as well as Hibiki, the first simultaneous voice-to-voice translation model with voice preservation to run on a mobile phone. This multistream multimodal model can also be turned into a visual-speech model (VSM) via cross-attention with visual information, which allows Moshi to freely discuss about an image while maintaining its natural conversation style and low latency. This talk will provide an illustrated tour of this research.

Speaker: Patrick Pérez

Biography
Patrick Pérez is CEO at Kyutai, a non-profit open-science AI lab, based in Paris. Prior to this, Patrick was at Valeo as VP of AI and Scientific Director of valeo.ai (2018-2023), and with Technicolor (2009-2018), Inria (1993-2000, 2004-2009) and Microsoft Research Cambridge (2000-2004) as research scientist. His research interests lie in reliable multimodal AI for the benefit of all

Moderators: Raffaello Camoriano and Gabriele Rosi of Department of Control and Computer Engineering-DAUIN.

The event wll be held on Zoom platform at this link.

A Multistream Multimodal Foundation Model for Real-Time Voice-Based Application

Informazioni

Condividi su