Updated 2023-10-27
The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an encoder, a context vector, and a decoder, as illustrated in Figure 1:
An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.
Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":
A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.
The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?
Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeurIPS, 2014.*
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., ICLR, 2015.*
Let be an input sequence, where is the 'th word in the sequence and is an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, and , which are defined in the same way as in the RNN for text classification. Notice that the end-of-sequence token is used to create an additional hidden state , which in turn creates the context vector .
Is it possible to derive the context vector from instead of ? What is the purpose of appending an extra token to indicate the end of the sequence?
Let be an output sequence, where is the 'th word in the sequence, and is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, and :
takes the previous output and its hidden state , and returns a hidden state such that , where , , and is an activation function.
takes the hidden state and returns an output such that , where .
Note that the initial hidden state is created by considering only the context vector such that the first output is solely predicted by the context in the input sequence. However, the prediction of every subsequent output is conditioned on both the previous output and its hidden state . Finally, the decoder stops generating output when it predicts the end-of-sequence token .
In some variations of the decoder, the initial hidden state is created by considering both and [1].
Figure 3 illustrates a decoder example that takes the context vector and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:
The likelihood of the current output can be calculated as:
where is a function that takes the context vector , previous input and its hidden state , and returns the probability of . Then, the maximum likelihood of the output sequence can be estimated as follows ():
The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token . What are the benefits of incorporating this artificial token when estimating the sequence probability?