Appsfactory, RTL and Microsoft create text-to-speech services for automated audio content
Cologne, 20 April 2022 – Together with Appsfactory, RTL Deutschland is paving the way for the next milestone in digital, automated content generation in the media sector. Among other things, in-app services involving voice synthesizing and including so-called “text-to-speech services” are being promoted. Here, natural-sounding speech is generated from human voices via artificial intelligence. This technology facilitates the creation of extensive audio content without the need for well-known RTL speakers to be recorded specially in a sound studio. This accelerates the production process considerably so that breaking news can be published with far greater speed, for example.
The project is part of a comprehensive partnership between Bertelsmann and Microsoft for technological innovations in the fields of media and education. It was developed in cooperation with Appsfactory and is funded by the Journalism Lab of the Media Authority of North Rhine-Westphalia (Landesanstalt für Medien NRW).
The challenge of making a program sound like a human being
The project team’s first objective was to synthesize the voices of RTL presenter Maik Meuser and podcast speaker Inken Wriedt. This required four hours of spoken text from each of the two voice talents. This resulted in natural-sounding versions that are almost indistinguishable from the original voice, including in terms of intonation, expression and speed. It follows that all the users’ favourite voices can be employed in many areas.
Key technology that changes the way newsrooms work
“Audio in all its forms is the current driver of innovation in the media industry. Thanks to the ability to synthesize voices, the field of content creation can be improved sustainably and made more efficient,” explains Jan Nowak, VP Media & AI at Appsfactory.
Well-known voices are an enormous success factor in the marketing of audio content. However, classic voice-overs are no longer possible or profitable, especially with a large amount of content. This is where the technology comes in, creating significant economic leverage.
“In addition to the quality of the final result, it was particularly important for us as Appsfactory experts to design the application and the associated text-to-speech applications with the users’ needs firmly in mind. We are convinced that today’s developments will improve the media sector and the way the world’s newsrooms work in the long term,” Nowak continues.
Isabella Thissen, Senior Vice President Editorial Products & Innovation at RTL Germany says: “Appsfactory’s user-friendly application enables us to convert texts into audio files very swiftly. The features including voice selection, text editing and speech rate adjustment are intuitive in terms of usability and provide an essential support for our products such as the read-aloud function for the text-based online news offerings of RTL and ntv.”
The pilot project follows strict ethical guidelines, as laid down in the Microsoft guidelines for the responsible use of artificial intelligence. In this way, synthetic voices are clearly identified for users in future applications. Specific content-related guidelines also exist. All content that could exert undue influence on the shaping of user opinions is excluded from the AI synthesization process. The project team maintains a dialogue with the Media Authority of North Rhine-Westphalia regarding these questions.