During my internship at Neosapience, one of the leading tech startups in speech synthesis field in Korea, I was assigned to develop a singing voice synthesis model.
In this project, I came up with an idea of adapting existing End-to-end TTS architecture (Tacotron) and manipulating it into singing voice synthesis system.
I took this approach because there were not enough singing voice data (especially Korean singing voice data) when I first started the singing voice synthesis project while relatively huge speech data were accessible. I hoped to extend the speech synthesis model that can already produce correct pronunciation and vocal spectral characteristics to be capable of singing, which requires detailed control of both pitch and duration of each phoneme or syllable.
Ideally, singing voice synthesis should cover a huge variety of musical expressions. However, until I can get access to a proper size of singing voice dataset, I’ve decided to experiment how the basic functionality of singing (manipulating the pitch and duration of the lyrics) can be developed.
Tacotron is a very sophisticated architecture that enables learning to generate spectrogram-level speech data directly from grapheme-level text input, while the traditional TTS pipeline is composed of a separate linguistic model (grapheme to phoneme translator), an acoustic model (phoneme level duration and pitch estimator), and an audio synthesis module, using hierarchical probabilistic architecture. Recent deep learning technologies also have been adapted to each stage of this pipeline, however, to my knowledge, Tacotron was the first attempt of an end-to-end generative text-to-speech model that combines all of the stages and synthesizes speech directly from grapheme-level characters.
Tacotron architecture is composed of 3 main components, a text encoder, a spectrogram decoder, and an attention module that bridges the two. When training, grapheme level textual information is encoded into a sequence of embeddings and frame-by-frame spectrogram data is generated auto-regressively referencing the proper part of the textual embedding sequence with the help of the attention module. The naturalness of pronunciation, pitch, and duration is achieved by how accurately decoder spots the area of text embedding sequence to generate as spectrogram frame.
In recent researches from Google and Baidu, extended versions of Tacotron has been explored that enables global conditioning of Tacotron model with a demanded prosody characteristics. However, my current objective was rather direct and simple, that is to manually control pitch and duration of speech data while maintaining natural pronunciation.
Pitch conditioned decoder module
To first implement the pitch controllability, I came up with an idea of retraining the model with conditioning by pitch sequences that are estimated from the training data itself. I used DIO algorithm for pitch extraction because it was quick with only a tolerable amount of inaccuracy.
With a supplementary CNN module, I extracted pitch information embeddings along frame-wise decoding process. By concatenating pitch embedding to the middle layer of the decoder (right after it gets referenced by the attention module), the model was finally trained to follow the pitch condition when generating spectrogram frames.
I’ve also experimented conditioning the attention module also, hoping that the duration of each character can also be inferred by seeing pitch sequences. That is, if there is a certain characteristics or behaviors of pitch curves for separate syllables, the attention module could learn to estimate the proper duration (how long to stay in one syllable) from the pitch information. However, it did not work out.
Duration conditioning attempts
Again, due to the lack of singing voice data, I did not have any paired set of voice data and their syllable-level duration information. To condition the model the way I did with pitch information, I had to get the duration of each syllable (certain chunks of grapheme level characters) of the speech dataset.
To tackle this problem, I first came up with an idea using KALDI framework, which is a widely used speech recognition architecture. However, even after estimating the duration of each phoneme, I found that it was too complicated to get the ‘syllable-level’ duration from the phoneme-level duration information.
Therefore, I switched to using pre-trained vanilla Tacotron model. After sufficiently training the model (when it successfully generates naturally sounding speech), I extracted attention matrices of all training data. Since each row vector represents the period when the decoder’s attention to the corresponding character is activated, I could get the ‘rough’ duration of each syllable. With character-wise sequence, it was much easier to get the syllable-level chunk.
Similarly to pitch embedding condition process, I conditioned the text encoder with the duration information. With a simple attempt of putting the duration value itself, however, the model could not learn the conditional decision of attention activation.
As I was running out of days of my internship, I reserved this approach for future additional attempts.
Forced masking on the attention module to manipulate duration
To directly force the duration of each syllable, I decided to manually manipulate the attention module. I first trained pitch-only conditioned Tacotron model. Then in the generation phase, I masked the attention module with given syllable-wise duration values. This masking forces the attention module to stay in activating only the characters (character embeddings, to be precise) within the current syllable. In result, we can get the generation output of roughly desired durations.
I hard-coded some more details to this forced masking also, dividing the syllable-level duration into three (onset/nucleus/coda) parts or two (onset-nucleus/coda) divisions.
Here are some examples of singing voice synthesized with the model.
Interim sum up and on-going procedures
Well, I can’t say that the outcome is satisfactory enough yet, however, I learned a lot from all of the attempts I took.
For the future attempts, I’d like to try different additional modules or form of data feeding policies for the duration values I got from pre-trained models’ attention matrices.
Or once I can get the access to a sufficient amount of singing voice dataset, I could completely redesign the architecture from the beginning to more efficiently model this particular task.