Skip to the content.

PHEME: Efficient and Conversational Speech Generation.

GigaSpeech One-shot1 TTS Examples

  1. One-shot - inference setup for voices unseen at the training time, when prompts and speaker embeddings are provided as additional model inputs.
Prompt audio Reference audio PHEME (100M) PHEME (300M) no speaker embeddings PHEME (300M) Prompt text Reference text
let’s just say in her own words, once i sat down and watched it i never moved, i w as enthralled by it. and she told me the next time she went back she would take me with her. and i waited, of course, like i said, thirteen years.
in early twenty-twenty, blue apron put the word out that it was interested in possibly getting scooped up. maybe by a big grocery chain. or someone else with deep pockets who wanted to own a meal kit delivery business. at the same time, garcia says, the company acted like it was in turnaround mode. it decid ed to streamline operations, including shutting down its fulfillment center in texas
aside from influencing basically everyone who matters he was one of the first if not, in fact the first artist to bring an electric guitar player with him on to the grand oleopry stag e. if you want to call it a honky tonk, and it happened after ernest tubb. it was influenced by ernest tubb. before i get to the story and episode, i’d like to address one other thing.
so it’s ah i think there’s a range of risks, but generally speaking ah there’s goi ng to be a study increase in the floor of the skill level as these ah a i technologies diffuse. that is, there will be more and more ah capabilities available to people at the bottom of the scale, that is individuals as well as people with more access to computing power, ah money, and data at the higher end.
so after they put in their name, phone number, email address onto your landing pag e. where would you like to send them? would you like to send them to your facebook page your website? book an appointment to a buyer on facebook messenger bot, a seller messenger bot. where w ould you like to send them? so for this example i’m just gonna say book an appointment.

Artificial Voice TTS Examples

Prompt audio Reference audio PHEME (300M) no training on artificial voice PHEME (300M) Prompt text Reference text
Our garden terrace is a lovely spot for afternoon tea. The city’s ghost walk is a spooky and fascinating evening adventure.
If you need a quiet place to work, our library is just perfect. Our hotel’s evening bonfires are a great place to socialize.
There’s a delightful chocolate factory tour, great for families. Our rooftop jazz nights feature some of the best local talent.
The rooftop bar hosts a live DJ on Friday nights. Our in-house sommelier leads an exquisite wine and cheese pairing event.
The comedy club in town is known for its hilarious acts. The annual food fair showcases the best of local cuisine.

Inference speed with Triton-LLM (RTFs, lower is better) for short and long sentences

Model short long GPU
MQTTS (100M) 1.930 1.842 A100
PHEME-SMALL (100M) 0.133 0.133 A100
PHEME-LARGE (300M) 0.143 0.143 A100