arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Model Design

hashtag
Encoder Challenge

Given the input text W={w1,…,wn}W = \{w_1, \ldots, w_n\}W={w1​,…,wn​} where wiw_iwi​ is the iii'th token in ​WWW, a contextualized encoder (e.g., BERTarrow-up-right) takes WWW and generates an embedding ei∈R1Γ—de_i \in \mathbb{R}^{1 \times d}eiβ€‹βˆˆR1Γ—d for every token wi∈Ww_i \in Wwiβ€‹βˆˆWusing wiw_iwi​ as well as its context. The challenge is that this encoder can take only up the mmm-number of tokens such that it cannot handle any input where n>mn > mn>m.

circle-exclamation

What are the ways to handle arbitrarily large input using a contextualized encoder?

hashtag
Baseline

One popular method is called the "Sliding Window", which splits the input into multiple blocks of text, generates embeddings for each block separately, and merges them at the end.

Let where if ; otherwise, such that . Then, the encoder takes each ​ and generates for every token in . Finally, the embedding matrix is created by sequentially stacking all embeddings in .

chevron-rightWhat are the potential issues with this baseline method?hashtag

The baseline method does not have enough context to generate high-quality embeddings for tokens on the edge of each block.

hashtag
Advanced (Exercise)

Modify the baseline method such that a block has overlapped tokens with its surrounding blocks (both front and back). Once all blocks are encoded, each overlapped token should have two embeddings. Create an average embedding of those two embeddings and make it the final embedding for the overlapped token.

hashtag
Decoder Challenge

In a (aka, an encoder-decoder model), a decoder takes an embedding matrix and predicts what token should come next. It is often the case that this embedding matrix is also bounded by a certain size, which becomes an issue when the size of the matrix becomes larger than (for the case above, where ). One common method to handle this issue is to use an attention matrix for dimensionality reduction as follows:

The embedding matrix is first transposed to then multiplied by an attention matrix such that . Finally, the transpose of , that is gets fed into the decoder.

circle-exclamation

Would the following method be equivalent to the above method?

An attention matrix is multiplied by the embedding matrix such that . Finally, gets fed into the decoder.

W = W_1 \cup \cdots \cup W_k \
Wh={w(hβˆ’1)m+1,…,whm}W_h = \{w_{(h-1)m+1}, \ldots, w_{hm}\}Wh​={w(hβˆ’1)m+1​,…,whm​}
hm<nhm < nhm<n
Wh={w(hβˆ’1)m+1,…,wn}W_h = \{w_{(h-1)m+1}, \ldots, w_{n}\}Wh​={w(hβˆ’1)m+1​,…,wn​}
km≀nkm \leq nkm≀n
WhW_hWh​
Eh={e(hβˆ’1)m+1,…,ehm}E_h = \{e_{(h-1)m+1}, \ldots, e_{hm}\}Eh​={e(hβˆ’1)m+1​,…,ehm​}
WhW_hWh​
E∈RnΓ—dE \in \mathbb{R}^{n \times d}E∈RnΓ—d
Wβˆ€hW_{\forall h}Wβˆ€h​
E∈RmΓ—dE \in \mathbb{R}^{m \times d}E∈RmΓ—d
mmm
E∈RnΓ—dE \in \mathbb{R}^{n \times d}E∈RnΓ—d
n>mn > mn>m
E∈RnΓ—dE \in \mathbb{R}^{n \times d}E∈RnΓ—d
ET∈RdΓ—nE^T \in \mathbb{R}^{d \times n}ET∈RdΓ—n
A∈RnΓ—mA \in \mathbb{R}^{n \times m}A∈RnΓ—m
ETβ‹…Aβ†’D∈RdΓ—mE^T \cdot A \rightarrow D \in \mathbb{R}^{d \times m}ETβ‹…Aβ†’D∈RdΓ—m
DDD
DT∈RmΓ—dD^T \in \mathbb{R}^{m \times d}DT∈RmΓ—d
A∈RmΓ—nA \in \mathbb{R}^{m \times n}A∈RmΓ—n
E∈RnΓ—dE \in \mathbb{R}^{n \times d}E∈RnΓ—d
Aβ‹…Eβ†’D∈RmΓ—dA \cdot E \rightarrow D \in \mathbb{R}^{m \times d}Aβ‹…Eβ†’D∈RmΓ—d
DDD
sequence-to-sequence modelarrow-up-right