Skip to main content

Models

Listen, Attend and Spell (LAS)

las

  • Attention-based Encoder Decoder (AED)
  • The input is a sequence of tt acoustic feature vectors F=f1,f2,...,ftF = f_1, f_2, ..., f_t, one vector per 10 ms frame
  • The output can be letters or wordpieces
  • The encoder-decoder architecture is appropriate when input and output sequences have stark length differences: very long acoustic feature sequences mapping to much shorter sequences of letters or words
  • Encoder-decoder architecture for speech need to have a compression stage that shortens the acoustic feature sequence before the encoder stage; alternatively use CTC loss function
    • e.g. Low frame rate algorithm: for time ii, concatenate the acoustic feature vector fif_i with the prior two vectors fi1,fi2f_{i-1}, f_{i-2}

Adding a LM

  • Since an encoder-decoder model is essentially a conditional language model, they implicitly learn a language model for the output domain of letters from their training data
  • Speech-text datasets are much smaller than pure text datasets
  • We can usually improve a model at least slightly by incorporating a very large language model

score(YX)=1YclogP(YX)+λlogPLM(Y)score(Y|X) = \frac{1}{|Y|_c} \log{P(Y|X)} + \lambda \log{P_{LM}(Y)}

  • To get a final beam of hypothesized sentences, a.k.a n-best list
  • The scoring is done by interpolating the LM score and encoder-decoder score, with a weight λ\lambda tune on the held-out set
  • Since most models prefer shorter sentences, normalize the probability by the number of characters in the hypothesis Yc|Y|_c

Connectionist Temporal Classification (CTC)

ctc

  • The intuition of CTC is to output a single character for every frame of the input, so that the output is the same length as the input
  • And then to apply a collapsing function that combines sequences of identical letters, resulting in a shorter sequence
  • The CTC collapsing function is many-to-one; lots of different alignments map to the same output string

Inference

  • The most probable output sequence YY is the one that has, not the single best CTC alignment, but the highest sum over the probability of all its possible alignments:

PCTC(YX)=AB1(Y)P(AX)=AB1(Y)t=1Tp(atht)P_{CTC}(Y|X) = \sum_{A \in B^{-1}(Y)}{P(A|X)} = \sum_{A \in B^{-1}(Y)}{\prod_{t=1}^{T}{p(a_t|h_t)}}

Y^=arg maxYPCTC(YX)\hat{Y} = \argmax_{Y}{P_{CTC}{(Y|X)}}

  • Because of the strong conditional independence assumption (given the input, the output at time tt is independent of the output at time t1t-1), CTC does not implicitly learn a language model
  • Essential using CTC to interpolate a language model

Training

The loss for an entire dataset DD is the sum of the negative log-likelihoods of the correct output YY for each input XX:

LCTC=(X,Y)DlogPCTC(YX)L_{CTC} = \sum_{(X,Y) \in D}{-\log{P_{CTC}(Y|X)}}

RNN-Transducer (RNN-T)

rnn-t

  • Because of the strong independence assumption in CTC, recognizers based on CTC don't achieve as high an accuracy as the attention-based encoder-decoder recognizers
  • CTC recognizers have the advantage of streaming, recognizing words on-line rather than waiting until the end of the sentence to recognize them
  • Removes the conditional independence assumtpion
  • Two main components: a CTC acoustic model, and a separate language model component called the predictor that conditions on the output token history
  • At each time step tt, the CTC encoder outputs a hidden state htench_{t}^{\text{enc}} given the input x1xtx_1 \dots x_t
  • The language model predictor takes as input the previous output token, outputting a hidden state htpredh_{t}^{\text{pred}}
  • The two hidden states are passed through the joint network, whose output is then passed through a softmax to predict the next character

Tacotron2

tacotron2

  • Extends the earlier Tacotron architecture and the Wavenet vocoder
  • An encoder-decoder maps from graphemes to mel spectograms, followed by a vocoder that maps to wavefiles
  • Location-based attention, in which the computation of the α\alpha values makes use of the α\alpha values from the prior time-state
  • Autoregressively predict one 80-dimensional log-mel filterbank vector frame (50 ms, with a 12.5 ms stride) at each step
  • Stop token prediction, decision about whether to stop producing output
  • Trained on gold log-mel filterbank features, using teacher forcing

WaveNet

  • An autoregressive network
  • Takes spectograms as input and produces output represented as sequences of 8-bit μ\mu-law audio samples
  • This means that we can predict the value of each sample with a simple 256-way categorical classifier

The probability of a waveform, a sequence of 8-bit mu-law values Y=y1,,ytY = y_1, \dots, y_t, given an intermediate input mel spectogram hh is computed as:

p(Y)=t=1tP(yty1,,yt1,h1,,ht)p(Y) = \prod_{t=1}^{t}{P(y_t | y_1,\dots, y_{t-1}, h_1, \dots, h_t)}

  • The probability distribution is modeled by dilated convolution, a subtype of causal convolutional layer
  • Dilated convolutions allow the vocoder to grow the receptive field exponentially with depth

dilated-convolution