PHEME: Efficient and Conversational Speech Generation.
- Abstract. In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation which also operates efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we propose the PHEME model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained PHEME checkpoints, relying solely on synthetic speech generated by much larger teacher models.
- Code
- Demo
- Paper
GigaSpeech One-shot1 TTS Examples
- One-shot - inference setup for voices unseen at the training time, when prompts and speaker embeddings are provided as additional model inputs.↩
Prompt audio | Reference audio | PHEME (100M) | PHEME (300M) no speaker embeddings | PHEME (300M) | Prompt text | Reference text |
---|---|---|---|---|---|---|
let’s just say in her own words, once i sat down and watched it i never moved, i w as enthralled by it. | and she told me the next time she went back she would take me with her. and i waited, of course, like i said, thirteen years. | |||||
in early twenty-twenty, blue apron put the word out that it was interested in possibly getting scooped up. maybe by a big grocery chain. or someone else with deep pockets who wanted to own a meal kit delivery business. | at the same time, garcia says, the company acted like it was in turnaround mode. it decid ed to streamline operations, including shutting down its fulfillment center in texas | |||||
aside from influencing basically everyone who matters he was one of the first if not, in fact the first artist to bring an electric guitar player with him on to the grand oleopry stag e. | if you want to call it a honky tonk, and it happened after ernest tubb. it was influenced by ernest tubb. before i get to the story and episode, i’d like to address one other thing. | |||||
so it’s ah i think there’s a range of risks, but generally speaking ah there’s goi ng to be a study increase in the floor of the skill level as these ah a i technologies diffuse. | that is, there will be more and more ah capabilities available to people at the bottom of the scale, that is individuals as well as people with more access to computing power, ah money, and data at the higher end. | |||||
so after they put in their name, phone number, email address onto your landing pag e. where would you like to send them? would you like to send them to your facebook page your website? | book an appointment to a buyer on facebook messenger bot, a seller messenger bot. where w ould you like to send them? so for this example i’m just gonna say book an appointment. |
Artificial Voice TTS Examples
Prompt audio | Reference audio | PHEME (300M) no training on artificial voice | PHEME (300M) | Prompt text | Reference text |
---|---|---|---|---|---|
Our garden terrace is a lovely spot for afternoon tea. | The city’s ghost walk is a spooky and fascinating evening adventure. | ||||
If you need a quiet place to work, our library is just perfect. | Our hotel’s evening bonfires are a great place to socialize. | ||||
There’s a delightful chocolate factory tour, great for families. | Our rooftop jazz nights feature some of the best local talent. | ||||
The rooftop bar hosts a live DJ on Friday nights. | Our in-house sommelier leads an exquisite wine and cheese pairing event. | ||||
The comedy club in town is known for its hilarious acts. | The annual food fair showcases the best of local cuisine. |
Inference speed with Triton-LLM (RTFs, lower is better) for short and long sentences
Model | short | long | GPU |
---|---|---|---|
MQTTS (100M) | 1.930 | 1.842 | A100 |
PHEME-SMALL (100M) | 0.133 | 0.133 | A100 |
PHEME-LARGE (300M) | 0.143 | 0.143 | A100 |