Home Community How is AI Revolutionizing Audiobook Production? Creating 1000’s of High-Quality Audiobooks from E-books with Neural Text-to-Speech Technology

How is AI Revolutionizing Audiobook Production? Creating 1000’s of High-Quality Audiobooks from E-books with Neural Text-to-Speech Technology

0
How is AI Revolutionizing Audiobook Production? Creating 1000’s of High-Quality Audiobooks from E-books with Neural Text-to-Speech Technology

Nowadays, many individuals read audiobooks as a substitute of books or other media. Audiobooks not only let current readers enjoy information while on the road, but they may additionally help make content accessible to groups, including children, the visually impaired, and anyone learning a brand new language. Traditional audiobook production techniques take money and time and may end up in various recording quality, corresponding to skilled human narration or volunteer-driven initiatives like LibriVox. Resulting from these issues, maintaining with the rising variety of published books takes effort and time. 

Nevertheless, automatic audiobook creation has historically suffered because of the robotic nature of text-to-speech systems and the problem in deciding what text shouldn’t be read aloud (corresponding to tables of contents, page numbers, figures, and footnotes). They supply a technique for overcoming the abovementioned difficulties by creating high-quality audiobooks from various online e-book collections. Their approach specifically incorporates recent developments in neural text-to-speech, expressive reading, scalable computation, and automatic recognition of pertinent content to supply 1000’s of natural-sounding audiobooks. 

They contribute over 5,000 audiobooks value of speech, totaling over 35,000 hours, to the open source. In addition they provide demonstration software that permits conference participants to make their audiobooks by reading any book from the library aloud of their voices using only a transient sample of sound. This work introduces a scalable method for converting HTML-based e-books to excellent audiobooks. SynapseML, a scalable machine learning platform that permits distributed orchestration of the entire audiobook generation process, is the muse for his or her pipeline. Their distribution chain starts with 1000’s of Project Gutenberg-provided free e-books. They deal mostly with the HTML format of those e-books because it lends itself to automated parsing, the perfect of all of the available formats for these publications. 

In consequence, we could organize and visualize the entire collection of Project Gutenberg HTML pages and discover many sizable groups of similarly structured files. The most important classes of e-books were transformed into a regular format that could possibly be mechanically processed using a rule-based HTML normalizer created using these collections of HTML files. Because of this approach, we developed a system that might swiftly and deterministically parse an enormous variety of books. Most importantly, it allowed us to give attention to the files that might end in high-quality recordings when read. 

Figure 1: t-SNE Clustered ebook representation. Clusters of books with the identical format are shown by coloured regions.

The outcomes of this approach for clustering are shown in Figure 1, which illustrates how various groups of similarly organized electronic books spontaneously arise within the Project Gutenberg collection. After processing, a plain text stream could also be extracted and fed into text-to-speech algorithms. There are a lot of reading techniques required for various audiobooks. A transparent, objective voice is best for nonfiction, whereas an expressive reading and a little bit “acting” are higher for fiction with dialogue. Nevertheless, of their live demonstration, they’ll provide customers the choice to change the text’s voice, pace, pitch, and intonation. For the majority of the books, they utilize a transparent and neutral neural text-to-speech voice. 

They use zero-shot text-to-speech techniques to transfer the voice effectively features from a small variety of enrolled recordings to duplicate a user’s voice. By doing this, a user may rapidly produce an audiobook of their voice, utilizing only a tiny little bit of audio that has been captured. They employ an automatic speaker and emotion inference system to dynamically alter the reading voice and tone based on context to supply an emotional text reading. This enhances the lifelikeness and interest of sequences with several people and dynamic interaction. 

To do that, they first divide the text into narrative and conversation, designating a unique speaker for every line of dialogue. Then, self-supervised, they predict each dialogue’s emotional tone. Finally, they use the multi-style and contextual-based neural text-to-speech model introduced to assign distinct voices and emotions to the narrator and the character conversations. They think this approach might significantly increase the supply and accessibility of audiobooks.


Take a look at the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to hitch our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

In the event you like our work, you’ll love our newsletter..


Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects geared toward harnessing the ability of machine learning. His research interest is image processing and is keen about constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


🚀 The top of project management by humans (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here