Last year, we saw the emergence of Artificial Intelligence (AI) tools that can create images, artwork, or even video with a text prompt. These tools have caused a lot of excitement, as well as fear, about the future of AI in the creative field. In particular, OpenAI’s ChatGPT caused a stir with its ability to generate human-like text.
As we begin the new year, 2023, another powerful use case for AI has come to the forefront – a text-to-speech tool that can perfectly mimic a person’s voice. Developed by Microsoft, VALL-E (Voice-Aware Language Learning) can take a three-second recording of someone’s voice and replicate it, turning written words into speech with realistic intonation and emotion depending on the context of the text.
According to a paper published by Cornell University, the VALL-E team trained their model with 60,000 hours worth of English speech recordings from over 7,000 unique speakers. This allowed the model to be used in a “zero-shot situation”, which means that it can produce speech without any prior examples or training in a specific context or situation. The team claims that their Text-to-Speech (TTS) system used hundreds of times more data than existing TTS systems, which helped them to overcome the zero-shot issue.
Currently, VALL-E is not available for public use, but it raises many questions about safety, given that it could be used to generate any text in any voice. However, the creators of VALL-E have provided a demo that showcases a number of three-second speaker prompts and a demonstration of the text-to-speech in action, where the voice is correctly mimicked. In the demo, you can compare the results with the “ground truth” – the actual speaker reading the prompt text – and the “baseline” result from current TTS technology.
Microsoft has invested heavily in AI and is one of the backers of OpenAI, the company behind ChatGPT and DALL-E, a text-to-image or art tool. In 2019, Microsoft invested $1 billion in OpenAI and according to a recent report, it is looking to invest another $10 billion in the company.
Overall, VALL-E is a powerful tool that showcases the potential of AI in the field of text-to-speech. While it is not currently available for public use, it raises important questions about safety and the implications of such technology in the future. As with any new technology, it is important to consider the potential risks as well as the benefits and to ensure that appropriate measures are in place to protect against misuse.