Encoder-Decoder Framework

Updated 2023-10-27

The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an encoder, a context vector, and a decoder, as illustrated in Figure 1:

Encoder

An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.

Let X=[x1,,xn,xc]X =[x_1, \ldots, x_n,x_c] be an input sequence, where xiRd×1x_i \in \mathbb{R}^{d \times 1} is the ii'th word in the sequence and xcx_cis an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, ff and gg, which are defined in the same way as in the RNN for text classification. Notice that the end-of-sequence token xcx_c is used to create an additional hidden state hch_c, which in turn creates the context vector ycy_c.

Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":

Is it possible to derive the context vector from xnx_n instead of xcx_c? What is the purpose of appending an extra token to indicate the end of the sequence?

Decoder

A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.

Let Y=[y1,,ym,yt]Y = [y_1, \ldots, y_m, y_{t}] be an output sequence, where yiRo×1y_i \in \mathbb{R}^{o \times 1} is the ii'th word in the sequence, and yty_t is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, ff and gg :

  • ff takes the previous output yi1y_{i-1} and its hidden state si1s_{i-1}, and returns a hidden state siRe×1s_{i} \in \mathbb{R}^{e \times 1} such that f(yi1,si1)=α(Wyyi1+Wssi1)=sif(y_{i-1}, s_{i-1}) = \alpha(W^y y_{i-1}​ + W^s s_ {i−1}) = s_i, where WyRe×oW^y \in \mathbb{R}^{e \times o}, WsRe×eW^s \in \mathbb{R}^{e \times e}, and α\alpha is an activation function.

  • gg takes the hidden state sis_i and returns an output yiRo×1y_{i} \in \mathbb{R}^{o \times 1} such that g(si)=Wosi=yig(s_i) = W^o s_i = y_{i}, where WoRo×eW^o \in \mathbb{R}^{o \times e}.

Note that the initial hidden state s1s_1 is created by considering only the context vector ycy_csuch that the first output y1y_1 is solely predicted by the context in the input sequence. However, the prediction of every subsequent output yiy_{i} is conditioned on both the previous output yi1y_{i-1} and its hidden state si1s_{i-1}. Finally, the decoder stops generating output when it predicts the end-of-sequence token yty_t.

In some variations of the decoder, the initial hidden state s1s_1 is created by considering both ycy_c and hch_c [1].

Figure 3 illustrates a decoder example that takes the context vector y0y_0 and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:

The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?

The likelihood of the current output yiy_i can be calculated as:

P(yi{yc}{y1,,yi1})=q(yc,yi1,si1)P(y_{i} | \{y_c\} \cup \{y_1, \ldots, y_{i-1}\}) = q(y_c, y_{i-1}, s_{i-1})

where qq is a function that takes the context vector ycy_c, previous input yi1y_{i-1} and its hidden state si1s_{i-1}, and returns the probability of yiy_i. Then, the maximum likelihood of the output sequence can be estimated as follows (y0=s0=0y_0 = s_0 = \mathbf{0}):

P(Y)=i=1m+1q(yc,yi1,si1)P(Y) = \prod_{i=1}^{m+1} q(y_c, y_{i-1}, s_{i-1})

The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token yty_t. What are the benefits of incorporating this artificial token when estimating the sequence probability?

References

  1. Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeurIPS, 2014.*

Last updated

Copyright © 2023 All rights reserved