Research Practicum in Artificial Intelligence
Jinho D. Choi
  • Overview
    • Syllabus
    • Schedule
    • Discussions
  • Speed Dating
    • Profiles
  • Faculty Interests
    • AI Faculty
  • Research Areas
    • AI Conferences
  • Task Selection
  • Introduction
    • Motivation
    • Overview
    • Exercise
  • Related Work
    • Literature Review
    • Exercise
  • Approach
    • Algorithm Development
    • Model Design
    • Data Creation
  • Research Challenges
  • Experiments
    • Datasets
    • Models
    • Results
    • 5.4. Homework
  • Analysis
    • Performance Analysis
    • Error Analysis
    • Discussions
    • 6.4. Homework
  • Conclusion & Abstract
    • Conclusion
    • Title & Abstract
  • Peer Review
  • Presentations
  • Team Projects
    • Fall 2023
    • Fall 2022
  • Assignments
    • HW1: Speed Dating
    • HW2: Research Areas
    • HW3: Team Promotion
    • HW4: Introduction
    • HW5: Related Work
    • HW6: Approach
    • HW7: Experiments
    • HW8: Analysis
    • HW9: Conclusion & Abstract
    • HW10: Peer Review
    • Team Project
  • Supplementary
    • LaTex Guidelines
      • Getting Started
      • File Structure
      • Packages
      • References
      • Paragraphs
      • Labels
      • Tables
      • Figures
      • Lists
    • Writing Tips
    • Progress Reports
    • Team Promotion
Powered by GitBook
On this page
  • Encoder Challenge
  • Baseline
  • Advanced (Exercise)
  • Decoder Challenge
Export as PDF
  1. Approach

Model Design

PreviousAlgorithm DevelopmentNextData Creation

Last updated 8 months ago

Encoder Challenge

Given the input text W={w1,…,wn}W = \{w_1, \ldots, w_n\}W={w1​,…,wn​} where wiw_iwi​ is the iii'th token in ​WWW, a contextualized encoder (e.g., ) takes WWW and generates an embedding ei∈R1×de_i \in \mathbb{R}^{1 \times d}ei​∈R1×d for every token wi∈Ww_i \in Wwi​∈Wusing wiw_iwi​ as well as its context. The challenge is that this encoder can take only up the mmm-number of tokens such that it cannot handle any input where n>mn > mn>m.

What are the ways to handle arbitrarily large input using a contextualized encoder?

Baseline

One popular method is called the "Sliding Window", which splits the input into multiple blocks of text, generates embeddings for each block separately, and merges them at the end.

Let W = W_1 \cup \cdots \cup W_k \where Wh={w(h−1)m+1,…,whm}W_h = \{w_{(h-1)m+1}, \ldots, w_{hm}\}Wh​={w(h−1)m+1​,…,whm​} if hm<nhm < nhm<n; otherwise, Wh={w(h−1)m+1,…,wn}W_h = \{w_{(h-1)m+1}, \ldots, w_{n}\}Wh​={w(h−1)m+1​,…,wn​} such that km≤nkm \leq nkm≤n. Then, the encoder takes each WhW_hWh​​ and generates Eh={e(h−1)m+1,…,ehm}E_h = \{e_{(h-1)m+1}, \ldots, e_{hm}\}Eh​={e(h−1)m+1​,…,ehm​} for every token in WhW_hWh​. Finally, the embedding matrix E∈Rn×dE \in \mathbb{R}^{n \times d}E∈Rn×d is created by sequentially stacking all embeddings in W∀hW_{\forall h}W∀h​.

What are the potential issues with this baseline method?

The baseline method does not have enough context to generate high-quality embeddings for tokens on the edge of each block.

Advanced (Exercise)

Modify the baseline method such that a block has overlapped tokens with its surrounding blocks (both front and back). Once all blocks are encoded, each overlapped token should have two embeddings. Create an average embedding of those two embeddings and make it the final embedding for the overlapped token.

Decoder Challenge

In a (aka, an encoder-decoder model), a decoder takes an embedding matrix E∈Rm×dE \in \mathbb{R}^{m \times d}E∈Rm×d and predicts what token should come next. It is often the case that this embedding matrix is also bounded by a certain size, which becomes an issue when the size of the matrix becomes larger than mmm (for the case above, E∈Rn×dE \in \mathbb{R}^{n \times d}E∈Rn×d where n>mn > mn>m). One common method to handle this issue is to use an attention matrix for dimensionality reduction as follows:

The embedding matrix E∈Rn×dE \in \mathbb{R}^{n \times d}E∈Rn×d is first transposed to ET∈Rd×nE^T \in \mathbb{R}^{d \times n}ET∈Rd×n then multiplied by an attention matrix A∈Rn×mA \in \mathbb{R}^{n \times m}A∈Rn×m such that ET⋅A→D∈Rd×mE^T \cdot A \rightarrow D \in \mathbb{R}^{d \times m}ET⋅A→D∈Rd×m. Finally, the transpose of DDD, that is DT∈Rm×dD^T \in \mathbb{R}^{m \times d}DT∈Rm×d gets fed into the decoder.

Would the following method be equivalent to the above method?

An attention matrix A∈Rm×nA \in \mathbb{R}^{m \times n}A∈Rm×nis multiplied by the embedding matrix E∈Rn×dE \in \mathbb{R}^{n \times d}E∈Rn×d such that A⋅E→D∈Rm×dA \cdot E \rightarrow D \in \mathbb{R}^{m \times d}A⋅E→D∈Rm×d. Finally, DDD gets fed into the decoder.

BERT
sequence-to-sequence model