Building a Multilingual Single-Speaker Dataset
via Cross-Lingual Voice Cloning from LJSpeech
Abstract
Speech synthesis models transform written text into lifelike, natural-sounding speech. However, even in multilingual systems, they often produce different voices for each language due to the lack of robust cross-lingual datasets and benchmarks. In this work, we introduce the MLJSpeech corpus, a multilingual dataset created by machine translation and voice cloning the widely used LJSpeech dataset into multiple languages. To evaluate the quality of MLJSpeech, we conducted a Mean Opinion Score (MOS) assessment, achieving high perceptual quality across all target languages. The original LJSpeech received a MOS of 4.7±.65, while our synthesized dataset maintained comparable performance across languages, like French 4.41±.80 or Italian 4.43±.75. MLJSpeech represents a significant step toward advancing cross-lingual TTS systems and fostering inclusivity in multilingual speech synthesis research.
Table 1

Table 1: Evaluation of correctness, coherence and quality of synthesized and original en-US audios.

Figure 1

Figure 1: Evaluation of translated texts.

Original ID Text en-US de-DE es-ES fr-FR it-IT nl-NL pl-PL
LJ016-0341 The sufferer was stolid and reticent to the last.
LJ016-0398 As a special favor.
LJ017-0210 The first case was that of the 'Flowery Land'.
LJ016-0335 There was no interference with the crowd, which collected as usual, although not to the customary extent.
LJ027-0048 On the other hand are found structures which are perfectly homologous and yet in no way analogous.
LJ028-0218 Without a skirmish or a battle, he permitted them to enter Babylon, and, sparing the city, he delivered the King Nabonidus to him.
LJ042-0158 Left a far mean streak of independence brought on by neglect.
LJ030-0141 A subsequent bullet, which was lethal, shattered the right side of his skull.
LJSpeech is a widely used dataset in the Text-to-Speech (TTS) domain. It comprises approximately 24 hours of recordings from a single speaker reading passages from English nonfiction books. The audio was originally recorded by Linda Johnson as part of the LibriVox project. Corresponding texts were published between 1884 and 1964 and aligned by Keith Ito. Both have been released into the public domain. Since its release, LJSpeech has been extensively utilized to demonstrate various advancements in TTS systems. Its high recording quality and clean alignment make it a benchmark dataset for training and evaluating neural TTS models.