NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Encoder
  • Decoder
  • References

Was this helpful?

Export as PDF
  1. Contextual Encoding

Encoder-Decoder Framework

Updated 2023-10-27

PreviousTransformerNextHomework

Last updated 10 months ago

Was this helpful?

The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an , a context vector, and a , as illustrated in Figure 1:

Encoder

An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.

Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":

Decoder

A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.

The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?

References

Let X=[x1,…,xn,xc]X =[x_1, \ldots, x_n,x_c]X=[x1​,…,xn​,xc​] be an input sequence, where xi∈Rd×1x_i \in \mathbb{R}^{d \times 1}xi​∈Rd×1 is the iii'th word in the sequence and xcx_cxc​is an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, fff and ggg, which are defined in the same way as in the . Notice that the end-of-sequence token xcx_cxc​ is used to create an additional hidden state hch_chc​, which in turn creates the context vector ycy_cyc​.

Is it possible to derive the context vector from xnx_nxn​ instead of xcx_cxc​? What is the purpose of appending an extra token to indicate the end of the sequence?

Let Y=[y1,…,ym,yt]Y = [y_1, \ldots, y_m, y_{t}]Y=[y1​,…,ym​,yt​] be an output sequence, where yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1 is the iii'th word in the sequence, and yty_tyt​ is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, fff and ggg :

fff takes the previous output yi−1y_{i-1}yi−1​ and its hidden state si−1s_{i-1}si−1​, and returns a hidden state si∈Re×1s_{i} \in \mathbb{R}^{e \times 1}si​∈Re×1 such that f(yi−1,si−1)=α(Wyyi−1​+Wssi−1)=sif(y_{i-1}, s_{i-1}) = \alpha(W^y y_{i-1}​ + W^s s_ {i−1}) = s_if(yi−1​,si−1​)=α(Wyyi−1​​+Wssi−1​)=si​, where Wy∈Re×oW^y \in \mathbb{R}^{e \times o}Wy∈Re×o, Ws∈Re×eW^s \in \mathbb{R}^{e \times e}Ws∈Re×e, and α\alphaα is an activation function.

ggg takes the hidden state sis_isi​ and returns an output yi∈Ro×1y_{i} \in \mathbb{R}^{o \times 1}yi​∈Ro×1 such that g(si)=Wosi=yig(s_i) = W^o s_i = y_{i}g(si​)=Wosi​=yi​, where Wo∈Ro×eW^o \in \mathbb{R}^{o \times e}Wo∈Ro×e.

Note that the initial hidden state s1s_1s1​ is created by considering only the context vector ycy_cyc​such that the first output y1y_1y1​ is solely predicted by the context in the input sequence. However, the prediction of every subsequent output yiy_{i}yi​ is conditioned on both the previous output yi−1y_{i-1}yi−1​ and its hidden state si−1s_{i-1}si−1​. Finally, the decoder stops generating output when it predicts the end-of-sequence token yty_tyt​.

In some variations of the decoder, the initial hidden state s1s_1s1​ is created by considering both ycy_cyc​ and hch_chc​ [1].

Figure 3 illustrates a decoder example that takes the context vector y0y_0y0​ and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:

The likelihood of the current output yiy_iyi​ can be calculated as:

P(yi∣{yc}∪{y1,…,yi−1})=q(yc,yi−1,si−1)P(y_{i} | \{y_c\} \cup \{y_1, \ldots, y_{i-1}\}) = q(y_c, y_{i-1}, s_{i-1})P(yi​∣{yc​}∪{y1​,…,yi−1​})=q(yc​,yi−1​,si−1​)

where qqq is a function that takes the context vector ycy_cyc​, previous input yi−1y_{i-1}yi−1​ and its hidden state si−1s_{i-1}si−1​, and returns the probability of yiy_iyi​. Then, the of the output sequence can be estimated as follows (y0=s0=0y_0 = s_0 = \mathbf{0}y0​=s0​=0):

P(Y)=∏i=1m+1q(yc,yi−1,si−1)P(Y) = \prod_{i=1}^{m+1} q(y_c, y_{i-1}, s_{i-1})P(Y)=i=1∏m+1​q(yc​,yi−1​,si−1​)

The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token yty_tyt​. What are the benefits of incorporating this artificial token when estimating the sequence probability?

, Sutskever et al., NeurIPS, 2014.*

, Bahdanau et al., ICLR, 2015.*

Sequence to Sequence Learning with Neural Networks
Neural Machine Translation by Jointly Learning to Align and Translate
maximum likelihood
encoder
decoder
RNN for text classification
Figure 1 - An overview of an encoder-decoder framework.
Figure 2 - An encoder example.
Figure 3 - A decoder example.