Encoder-Decoder Framework
Updated 2023-10-27
The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an encoder, a context vector, and a decoder, as illustrated in Figure 1:

Encoder
An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.
Let X=[x1,…,xn,xc] be an input sequence, where xi∈Rd×1 is the i'th word in the sequence and xcis an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, f and g, which are defined in the same way as in the RNN for text classification. Notice that the end-of-sequence token xc is used to create an additional hidden state hc, which in turn creates the context vector yc.
Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":

Is it possible to derive the context vector from xn instead of xc? What is the purpose of appending an extra token to indicate the end of the sequence?
Decoder
A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.
Let Y=[y1,…,ym,yt] be an output sequence, where yi∈Ro×1 is the i'th word in the sequence, and yt is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, f and g :
f takes the previous output yi−1 and its hidden state si−1, and returns a hidden state si∈Re×1 such that f(yi−1,si−1)=α(Wyyi−1+Wssi−1)=si, where Wy∈Re×o, Ws∈Re×e, and α is an activation function.
g takes the hidden state si and returns an output yi∈Ro×1 such that g(si)=Wosi=yi, where Wo∈Ro×e.
Note that the initial hidden state s1 is created by considering only the context vector ycsuch that the first output y1 is solely predicted by the context in the input sequence. However, the prediction of every subsequent output yi is conditioned on both the previous output yi−1 and its hidden state si−1. Finally, the decoder stops generating output when it predicts the end-of-sequence token yt.
In some variations of the decoder, the initial hidden state s1 is created by considering both yc and hc [1].
Figure 3 illustrates a decoder example that takes the context vector y0 and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:

The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?
The likelihood of the current output yi can be calculated as:
where q is a function that takes the context vector yc, previous input yi−1 and its hidden state si−1, and returns the probability of yi. Then, the maximum likelihood of the output sequence can be estimated as follows (y0=s0=0):
The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token yt. What are the benefits of incorporating this artificial token when estimating the sequence probability?
References
Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeurIPS, 2014.*
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., ICLR, 2015.*
Last updated
Was this helpful?