Meta AI researchers have achieved a breakthrough in generative AI for speech. They developed Voicebox, the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance.
Like generative systems for images and text, Voicebox creates outputs in a vast variety of styles, and it can create outputs from scratch as well as modify a sample it’s given. But instead of creating a picture or a passage of text, Voicebox produces high-quality audio clips. The model can synthesize speech across six languages, as well as perform noise removal, content editing, style conversion, and diverse sample generation.
Prior to Voicebox, generative AI for speech required specific training for each task using carefully prepared training data. Voicebox uses a new approach to learn just from raw audio and an accompanying transcription. Unlike autoregressive models for audio generation, Voicebox can modify any part of a given sample, not just the end of an audio clip it is given.
Voicebox is based on a method called Flow Matching, which has been shown to improve upon diffusion models.