Singing Voice Synthesis based on Tacotron Architecture

During my internship at Neosapience, one of the leading tech startups in speech synthesis field in Korea, I was assigned to develop a singing voice synthesis model.

In this project, I came up with an idea of adapting existing End-to-end TTS architecture (Tacotron) and manipulating it into singing voice synthesis system.

I took this approach because there were not enough singing voice data (especially Korean singing voice data) when I first started the singing voice synthesis project while relatively huge speech data were accessible. I hoped to extend the speech synthesis model that can already produce correct pronunciation and vocal spectral characteristics to be capable of singing, which requires detailed control of both pitch and duration of each phoneme or syllable.

Ideally, singing voice synthesis should cover a huge variety of musical expressions. However, until I can get access to a proper size of singing voice dataset, I’ve decided to experiment how the basic functionality of singing (manipulating the pitch and duration of the lyrics) can be developed.

Tacotron architecture

Tacotron is a very sophisticated architecture that enables learning to generate spectrogram-level speech data directly from grapheme-level text input, while the traditional TTS pipeline is composed of a separate linguistic model (grapheme to phoneme translator), an acoustic model (phoneme level duration and pitch estimator), and an audio synthesis module, using hierarchical probabilistic architecture. Recent deep learning technologies also have been adapted to each stage of this pipeline, however, to my knowledge, Tacotron was the first attempt of an end-to-end generative text-to-speech model that combines all of the stages and synthesizes speech directly from grapheme-level characters.

<An example of TTS architecture based on traditional pipeline with an adaptation of deep learning approaches>

Tacotron architecture is composed of 3 main components, a text encoder, a spectrogram decoder, and an attention module that bridges the two. When training, grapheme level textual information is encoded into a sequence of embeddings and frame-by-frame spectrogram data is generated auto-regressively referencing the proper part of the textual embedding sequence with the help of the attention module. The naturalness of pronunciation, pitch, and duration is achieved by how accurately decoder spots the area of text embedding sequence to generate as spectrogram frame.

In recent researches from Google and Baidu, extended versions of Tacotron has been explored that enables global conditioning of Tacotron model with a demanded prosody characteristics. However, my current objective was rather direct and simple, that is to manually control pitch and duration of speech data while maintaining natural pronunciation.

Pitch conditioned decoder module

To first implement the pitch controllability, I came up with an idea of retraining the model with conditioning by pitch sequences that are estimated from the training data itself. I used DIO algorithm for pitch extraction because it was quick with only a tolerable amount of inaccuracy.

With a supplementary CNN module, I extracted pitch information embeddings along frame-wise decoding process. By concatenating pitch embedding to the middle layer of the decoder (right after it gets referenced by the attention module), the model was finally trained to follow the pitch condition when generating spectrogram frames.

I’ve also experimented conditioning the attention module also, hoping that the duration of each character can also be inferred by seeing pitch sequences. That is, if there is a certain characteristics or behaviors of pitch curves for separate syllables, the attention module could learn to estimate the proper duration (how long to stay in one syllable) from the pitch information. However, it did not work out.

<Conditioning with frame-wise pitch information in training phase>

Duration conditioning attempts

Again, due to the lack of singing voice data, I did not have any paired set of voice data and their syllable-level duration information. To condition the model the way I did with pitch information, I had to get the duration of each syllable (certain chunks of grapheme level characters) of the speech dataset.

To tackle this problem, I first came up with an idea using KALDI framework, which is a widely used speech recognition architecture. However, even after estimating the duration of each phoneme, I found that it was too complicated to get the ‘syllable-level’ duration from the phoneme-level duration information.

Therefore, I switched to using pre-trained vanilla Tacotron model. After sufficiently training the model (when it successfully generates naturally sounding speech), I extracted attention matrices of all training data. Since each row vector represents the period when the decoder’s attention to the corresponding character is activated, I could get the ‘rough’ duration of each syllable. With character-wise sequence, it was much easier to get the syllable-level chunk.

Similarly to pitch embedding condition process, I conditioned the text encoder with the duration information. With a simple attempt of putting the duration value itself, however, the model could not learn the conditional decision of attention activation.

<Conditioning the model with the syllable-level duration information that are estimated with KALDI (speech recognition framework) or another pre-trained vanilla Tacotron>

As I was running out of days of my internship, I reserved this approach for future additional attempts.

Forced masking on the attention module to manipulate duration

To directly force the duration of each syllable, I decided to manually manipulate the attention module. I first trained pitch-only conditioned Tacotron model. Then in the generation phase, I masked the attention module with given syllable-wise duration values. This masking forces the attention module to stay in activating only the characters (character embeddings, to be precise) within the current syllable. In result, we can get the generation output of roughly desired durations.

<Forced attention mechanism in generation phase>

<Forced masking on every syllable-level character chunks>

I hard-coded some more details to this forced masking also, dividing the syllable-level duration into three (onset/nucleus/coda) parts or two (onset-nucleus/coda) divisions.

Here are some examples of singing voice synthesized with the model.

<*Synthesized singing phrase with one of my colleague’s voice (above) /* its attention activation matrix (below) >

*<Synthesized singing phrase with a voice from open sourced Korean speech dataset (above) / its attention activation matrix (below) >*

Interim sum up and on-going procedures

Well, I can’t say that the outcome is satisfactory enough yet, however, I learned a lot from all of the attempts I took.

For the future attempts, I’d like to try different additional modules or form of data feeding policies for the duration values I got from pre-trained models’ attention matrices.

Or once I can get the access to a sufficient amount of singing voice dataset, I could completely redesign the architecture from the beginning to more efficiently model this particular task.

2 thoughts on “Singing Voice Synthesis based on Tacotron Architecture”

hfcbc

November 6, 2019 at 11:13 am

this work is excellent！
can you give more detail about this tacotron-sing model?
my email:313514820@qq.com

LikeLike

1. Jeong Choi
  
  November 16, 2019 at 2:44 pm
  
  Hi! thanx for your comment 🙂 Actually I’ve stopped working on this project last year, but there is a paper with similar approach that came out from NVIDIA this year. : https://nv-adlr.github.io/Mellotron
  Hope this would be informative 🙂
  
  LikeLike