Recently, Meta Platforms, known for being the company behind giants such as Facebook and Instagram, has taken a significant step in the field of artificial intelligence with the launch of Audiobox, an advanced program for voice cloning and generation of sound effects.
I have closely followed this innovation, and I would like to share with you my analysis and perspective on this tool that is beginning to attract attention in the technology sector.
Voice cloning is not a new concept in the technological world. Companies like ElevenLabs have already explored this field. However, what makes Audiobox stand out is its ability to accurately replicate the unique characteristics of the human voice: pitch, timbre, rhythms and particular pronunciations. This opens up a range of possibilities, from the creation of personalized content to more advanced artificial intelligence applications.
As soon as you enter audiobox.metademolab.com, you find this:
If you click on Try demos and it takes you to everything it can do:
In the first option he will make me read a text, in English, so that he understands what my voice is like. Once you have recognized it, you can create any phrase using the previous voice. We can record the text with our voice or use predefined voices, but if we want to clone, we will have to select the first option.
When it knows your voice, you tell it to read the phrase you see in the image above, and this is the result:
We can also ask to say phrases from text, indicating effects such as “do it as if you were in a cathedral” or, as I put it, “do it as if you were inside the bathroom.”
What I really find interesting is the underlying technology of Audiobox. It uses a supervised self-learning (SSL) technique, a machine learning methodology where artificial intelligence algorithms generate their own labels for unlabeled data. This approach is crucial, as it allows training with a wider variety of data, something essential in a field as diverse as audio.
To train Audiobox, Meta has used a vast collection of data: 160,000 hours of speech, 20,000 hours of music and 6,000 hours of sound samples. This data set is notable not only for its size but also for its diversity, including voices and sounds from more than 150 countries and in more than 200 languages.
As you can see, Meta has made interactive demos available to the public so that users can try the technology for themselves. This is an excellent opportunity to better understand the potential of Audiobox, allowing users to clone their own voice or generate new sounds from textual descriptions.
It is important to mention that, despite its potential, Audiobox comes with certain restrictions. At this time, its use is limited to non-commercial purposes and is restricted in certain US states due to local laws. Additionally, unlike other Meta projects, Audiobox is not open source, raising questions about its future accessibility and control over the technology.
This Meta project raises important questions about ethics in the use of AI, especially regarding data provenance and copyright. The ability to replicate human voices with such precision carries with it great responsibility and legal challenges that have yet to be fully explored.