arrow-left

All pages
gitbookPowered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Transformer

hashtag
Self Attention

Self-attention, also known as scaled dot-product attention, is a fundamental mechanism used in deep learning and natural language processing, particularly in transformer-based models like BERT, GPT, and their variants. Self-attention is a crucial component that enables these models to understand relationships and dependencies between words or tokens in a sequence.

Here's an overview of self-attention:

  1. The Motivation:

    • The primary motivation behind self-attention is to capture dependencies and relationships between different elements within a sequence, such as words in a sentence or tokens in a document.

    • It allows the model to consider the context of each element based on its relationships with other elements in the sequence.

  2. The Mechanism:

    • Self-attention computes a weighted sum of the input elements (usually vectors) for each element in the sequence. This means that each element can attend to and be influenced by all other elements.

    • The key idea is to learn weights (attention scores) that reflect how much focus each element should give to the others. These weights are often referred to as "attention weights."

  3. Attention Weights:

    • Attention weights are calculated using a similarity measure (typically the dot product) between a query vector and a set of key vectors.

    • The resulting attention weights are then used to take a weighted sum of the value vectors. This weighted sum forms the output for each element.

  4. Scaling and Softmax:

    • To stabilize the gradients during training, the dot products are often scaled by the square root of the dimension of the key vectors.

    • After scaling, a softmax function is applied to obtain the attention weights. The softmax ensures that the weights are normalized and sum to 1.

  5. Multi-Head Attention:

    • Many models use multi-head attention, where multiple sets of queries, keys, and values are learned. Each set of attention weights captures different aspects of relationships in the sequence.

    • These multiple sets of attention results are concatenated and linearly transformed to obtain the final output.

  6. Applications:

    • Self-attention is widely used in transformer-based models for various NLP tasks, including machine translation, text classification, text generation, and more.

    • It is also applied in computer vision tasks, such as image captioning, where it can capture relationships between different parts of an image.

Self-attention is a powerful mechanism because it allows the model to focus on different elements of the input sequence depending on the context. This enables the model to capture long-range dependencies, word relationships, and nuances in natural language, making it a crucial innovation in deep learning for NLP and related fields.

circle-exclamation

Q9: How does self-attention operate given an embedding matrix representing a document, where is the number of words and is the embedding dimension?

hashtag
Multi-head Attention

Multi-head attention is a crucial component in transformer-based models, such as BERT, GPT, and their variants. It extends the basic self-attention mechanism to capture different types of relationships and dependencies in a sequence. Here's an explanation of multi-head attention:

  1. Motivation:

    • The primary motivation behind multi-head attention is to enable a model to focus on different parts of the input sequence when capturing dependencies and relationships.

    • It allows the model to learn multiple sets of attention patterns, each suited to capturing different kinds of associations in the data.

Multi-head attention has proven to be a powerful tool in transformer architectures, enabling models to handle complex and nuanced relationships within sequences effectively. It contributes to the remarkable success of transformer-based models in a wide range of NLP tasks.

circle-exclamation

Q10: Given an embedding matrix representing a document, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?

hashtag
Transformer Architecture

circle-exclamation

Q11: What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?

hashtag
BERT

circle-exclamation

Q12: How is a Masked Language Model used in training a language model with a transformer?

circle-exclamation

Q13: How can one train a document-level embedding using a transformer?

circle-exclamation

Q14: What are the advantages of embeddings generated by BERT compared to those generated by ?

hashtag
References

  1. , Vaswani et al., NIPS 2017.

  2. , Devlin et al., NAACL, 2019.

Mechanism:

  • In multi-head attention, the input sequence (e.g., a sentence or document) is processed by multiple "attention heads."

  • Each attention head independently computes attention scores and weighted sums for the input sequence, resulting in multiple sets of output values.

  • These output values from each attention head are then concatenated and linearly transformed to obtain the final multi-head attention output.

  • Learning Different Dependencies:

    • Each attention head can learn to attend to different aspects of the input sequence. For instance, one head may focus on syntactic relationships, another on semantic relationships, and a third on longer-range dependencies.

    • By having multiple heads, the model can learn to capture a variety of dependencies, making it more versatile and robust.

  • Multi-Head Processing:

    • In each attention head, there are three main components: queries, keys, and values. These are linearly transformed projections of the input data.

    • For each head, queries are compared to keys to compute attention weights, which are then used to weight the values.

    • Each attention head performs these calculations independently, allowing it to learn a unique set of attention patterns.

  • Concatenation and Linear Transformation:

    • The output values from each attention head are concatenated into a single tensor.

    • A linear transformation is applied to this concatenated output to obtain the final multi-head attention result. The linear transformation helps the model combine information from all heads appropriately.

  • Applications:

    • Multi-head attention is widely used in NLP tasks, such as text classification, machine translation, and text generation.

    • It allows models to capture diverse dependencies and relationships within text data, making it highly effective in understanding and generating natural language.

  • W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d
    nnn
    ddd
    W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d
    Word2Vec
    Attention is All you Needarrow-up-right
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understandingarrow-up-right
    Multi-Head Attention. Figure 2 from [1].
    The Transformer architecture. Figure 1 from [1].
    BERT training mechanisms. Figure 1 from [2].

    Recurrent Neural Networks

    Update: 2023-10-26

    A Recurrent Neural Network (RNN) [1] maintains hidden states of previous inputs and uses them to predict outputs, allowing it to model temporal dependencies in sequential data.

    The hidden state is a vector representing the network's internal memory of the previous time step. It captures information from previous time steps and influences the predictions made at the current time step, often updated at each time step as the RNN processes a sequence of inputs.

    hashtag
    RNN for Sequence Tagging

    Given an input sequence where , an RNN for defines two functions, and :

    • takes the current input and the hidden state of the previous input , and returns a hidden state such that , where , , and is an .

    • takes the hidden state and returns an output such that , where .

    Figure 1 shows an example of an RNN for sequence tagging, such as :

    Notice that the output for the first input is predicted by considering only the input itself such that (e.g., the POS tag of the first word "I" is predicted solely using that word). However, the output for every other input is predicted by considering both and , an intermediate representation created explicitly for the task. This enables RNNs to capture sequential information that cannot.

    circle-exclamation

    Q6: How does each hidden state in a RNN encode information relevant to sequence tagging tasks?

    hashtag
    RNN for Text Classification

    Unlike sequence tagging where the RNN predicts a sequence of output for the input , an RNN designed for predicts only one output for the entire input sequence such that:

    • Sequence Tagging

    • Text Classification:

    To accomplish this, a common practice is to predict the output from the last hidden state using the function . Figure 2 shows an example of an RNN for text classification, such as :

    circle-exclamation

    Q7: In text classification tasks, what specific information is captured by the final hidden state of a RNN?

    hashtag
    Bidirectional RNN

    The above does not consider the words that follow the current word when predicting the output. This limitation can significantly impact model performance since contextual information following the current word can be crucial.

    For example, let us consider the word "early" in the following two sentences:

    • They are early birds -> "early" is an adjective.

    • They are early today -> "early" is an adverb.

    The POS tags of "early" depend on the following words, "birds" and "today", such that making the correct predictions becomes challenging without the following context.

    To overcome this challenge, a Bidirectional RNN is suggested [2] that considers both forward and backward directions, creating twice as many hidden states to capture a more comprehensive context. Figure 3 illustrates a bidirectional RNN for sequence tagging:

    For every , the hidden states and are created by considering and , respectively. The function takes both and and returns an output such that , where is a concatenation of the two hidden states and .

    circle-exclamation

    Q8: What are the advantages and limitations of implementing bidirectional RNNs for text classification and sequence tagging tasks?

    hashtag
    Advanced Topics

    • Long Short-Term Memory (LSTM) Networks [3-5]

    • Gated Recurrent Units (GRUs) [6-7]

    hashtag
    References

    1. , Elman, Cognitive Science, 14(2), 1990.

    2. , Schuster and Paliwal, IEEE Transactions on Signal Processing, 45(11), 1997.

    3. , Hochreiter and Schmidhuber, Neural Computation, 9(8), 1997 ( available at ResearchGate).

    End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRFarrow-up-right, Ma and Hovy, ACL, 2016.*

  • Contextual String Embeddings for Sequence Labelingarrow-up-right, Akbik et al., COLING, 2018.*

  • Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translationarrow-up-right, Cho et al., EMNLP, 2014.*

  • Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modelingarrow-up-right, Chung et al., NeurIPS Workshop on Deep Learning and Representation Learning, 2014.*

  • X=[x1,…,xn]X = [x_1, \ldots, x_n]X=[x1​,…,xn​]
    xi∈Rd×1x_i \in \mathbb{R}^{d \times 1}xi​∈Rd×1
    fff
    ggg
    fff
    xi∈Xx_i \in Xxi​∈X
    hi−1h_{i-1}hi−1​
    xi−1x_{i-1}xi−1​
    hi∈Re×1h_i \in \mathbb{R}^{e \times 1}hi​∈Re×1
    f(xi,hi−1)=α(Wxxi+Whhi−1)=hif(x_i, h_{i-1}) = \alpha(W^x x_i + W^h h_{i-1}) = h_if(xi​,hi−1​)=α(Wxxi​+Whhi−1​)=hi​
    Wx∈Re×dW^x \in \mathbb{R}^{e \times d}Wx∈Re×d
    Wh∈Re×eW^h \in \mathbb{R}^{e \times e}Wh∈Re×e
    α\alphaα
    ggg
    hih_ihi​
    yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1
    g(hi)=Wohi=yig(h_i) = W^o h_i = y_ig(hi​)=Wohi​=yi​
    Wo∈Ro×eW^o \in \mathbb{R}^{o \times e}Wo∈Ro×e
    y1y_1y1​
    x1x_1x1​
    f(x1,0)=α(Wxx1)=h1f(x_1, \mathbf{0}) = \alpha(W^x x_1) = h_1f(x1​,0)=α(Wxx1​)=h1​
    yiy_iyi​
    xix_ixi​
    xix_ixi​
    hi−1h_{i-1}hi−1​
    hih_ihi​
    Y=[y1,…,yn]Y = [y_1, \ldots, y_n]Y=[y1​,…,yn​]
    X=[x1,…,xn]X = [x_1, \ldots, x_n]X=[x1​,…,xn​]
    yyy
    RNNst(X)→Y\text{RNN}_{st}(X) \rightarrow YRNNst​(X)→Y
    RNNst(X)→y\text{RNN}_{st}(X) \rightarrow yRNNst​(X)→y
    yyy
    hnh_nhn​
    ggg
    hnh_nhn​
    xix_ixi​
    h→i\overrightarrow{h}_ihi​
    h←i\overleftarrow{h}_ihi​
    h→i−1\overrightarrow{h}_{i-1}hi−1​
    h←i+1\overleftarrow{h}_{i+1}hi+1​
    ggg
    h→i\overrightarrow{h}_ihi​
    h←i\overleftarrow{h}_ihi​
    yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1
    g(h→i,h←i)=Wo(h→i⊕h←i)=yig(\overrightarrow{h}_i, \overleftarrow{h}_i) = W^o (\overrightarrow{h}_i \oplus \overleftarrow{h}_i) = y_ig(hi​,hi​)=Wo(hi​⊕hi​)=yi​
    (h→i⊕h←i)∈R2e×1(\overrightarrow{h}_i \oplus \overleftarrow{h}_i) \in \mathbb{R}^{2e \times 1}(hi​⊕hi​)∈R2e×1
    Wo∈Ro×2eW^o \in \mathbb{R}^{o \times 2e}Wo∈Ro×2e
    sequence tagging
    activation function
    part-of-speech tagging
    Feedforward Neural Networks
    text classification
    sentiment analysis
    RNN for sequence tagging
    Finding Structure in Timearrow-up-right
    Bidirectional Recurrent Neural Networksarrow-up-right
    Long Short-Term Memoryarrow-up-right
    PDFarrow-up-right
    Figure 1 - An example of an RNN and its applicatoin in part-of-speech (POS) tagging.
    Figure 2 - An example of an RNN and its applicatoin in sentiment analysis.
    Figure 3 - An overview of a bidirectional RNN.

    Contextual Encoding

    Contextual representations are representations of words, phrases, or sentences within the context of the surrounding text. Unlike word embeddings from where each word is represented by a fixed vector regardless of its context, contextual representations capture the meaning of a word or sequence of words based on their context in a particular document such that the representation of a word can vary depending on the words surrounding it, allowing for a more nuanced understanding of meaning in natural language processing tasks.

    hashtag
    Contents

  • hashtag
    References

    • , Vaswani et al., Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.

    circle-exclamation

    Q1: How can document-level vector representations be derived from word embeddings?

    circle-exclamation

    Q2: How did the embedding representation facilitate the adaption of Neural Networks in Natural Language Processing?

    circle-exclamation

    Q3: How are embedding representations for Natural Language Processing fundamentally different from ones for Computer Vision?

    Word2Vec
    Subword Tokenization
    Recurrent Neural Networks
    Transformer
    Encoder-Decoder Framework
    Attention is All You Needarrow-up-right
    Word2Vec

    Homework

    HW5: Contextual Encoding

    hashtag
    Quiz

    1. The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?

    2. What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?

    3. How does self-attention operate given an embedding matrix representing a document, where is the number of words and is the embedding dimension?

    4. Given the same embedding matrix as in question #3, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?

    5. What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?

    6. How is a Masked Language Model used in training a language model with a transformer?

    7. How can one train a document-level embedding using a transformer?

    8. What are the advantages of embeddings generated by transformers compared to those generated by ?

    9. Neural networks gained widespread popularity for training natural language processing models since 2013. What factors enabled this popularity, and how do they differ from traditional NLP methods?

    10. Recent large language models like ChatGPT or Claude are trained quite differently from traditional NLP models. What are the main differences, and what factors enabled their development?

    hashtag
    References

    • , Vaswani et al., NIPS 2017.

    • , Devlin et al., NAACL 2019.

    W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d
    nnn
    ddd
    rule-based tokenizationarrow-up-right
    Word2Vec
    Attention is All you Needarrow-up-right
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understandingarrow-up-right

    Encoder-Decoder Framework

    Updated 2023-10-27

    The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an encoder, a context vector, and a decoder, as illustrated in Figure 1:

    Figure 1 - An overview of an encoder-decoder framework.

    hashtag
    Encoder

    An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.

    Let be an input sequence, where is the 'th word in the sequence and is an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, and , which are defined in the same way as in the . Notice that the end-of-sequence token is used to create an additional hidden state , which in turn creates the context vector .

    Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":

    circle-exclamation

    Is it possible to derive the context vector from instead of ? What is the purpose of appending an extra token to indicate the end of the sequence?

    hashtag
    Decoder

    A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.

    Let be an output sequence, where is the 'th word in the sequence, and is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, and :

    • takes the previous output and its hidden state , and returns a hidden state such that , where , , and is an activation function.

    • takes the hidden state and returns an output such that , where .

    Note that the initial hidden state is created by considering only the context vector such that the first output is solely predicted by the context in the input sequence. However, the prediction of every subsequent output is conditioned on both the previous output and its hidden state . Finally, the decoder stops generating output when it predicts the end-of-sequence token .

    circle-info

    In some variations of the decoder, the initial hidden state is created by considering both and [1].

    Figure 3 illustrates a decoder example that takes the context vector and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:

    circle-exclamation

    The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?

    The likelihood of the current output can be calculated as:

    where is a function that takes the context vector , previous input and its hidden state , and returns the probability of . Then, the of the output sequence can be estimated as follows ():

    circle-exclamation

    The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token . What are the benefits of incorporating this artificial token when estimating the sequence probability?

    hashtag
    References

    1. , Sutskever et al., NeurIPS, 2014.*

    2. , Bahdanau et al., ICLR, 2015.*

    X=[x1,…,xn,xc]X =[x_1, \ldots, x_n,x_c]X=[x1​,…,xn​,xc​]
    xi∈Rd×1x_i \in \mathbb{R}^{d \times 1}xi​∈Rd×1
    iii
    xcx_cxc​
    fff
    ggg
    xcx_cxc​
    hch_chc​
    ycy_cyc​
    xnx_nxn​
    xcx_cxc​
    Y=[y1,…,ym,yt]Y = [y_1, \ldots, y_m, y_{t}]Y=[y1​,…,ym​,yt​]
    yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1
    iii
    yty_tyt​
    fff
    ggg
    fff
    yi−1y_{i-1}yi−1​
    si−1s_{i-1}si−1​
    si∈Re×1s_{i} \in \mathbb{R}^{e \times 1}si​∈Re×1
    f(yi−1,si−1)=α(Wyyi−1​+Wssi−1)=sif(y_{i-1}, s_{i-1}) = \alpha(W^y y_{i-1}​ + W^s s_ {i−1}) = s_if(yi−1​,si−1​)=α(Wyyi−1​​+Wssi−1​)=si​
    Wy∈Re×oW^y \in \mathbb{R}^{e \times o}Wy∈Re×o
    Ws∈Re×eW^s \in \mathbb{R}^{e \times e}Ws∈Re×e
    α\alphaα
    ggg
    sis_isi​
    yi∈Ro×1y_{i} \in \mathbb{R}^{o \times 1}yi​∈Ro×1
    g(si)=Wosi=yig(s_i) = W^o s_i = y_{i}g(si​)=Wosi​=yi​
    Wo∈Ro×eW^o \in \mathbb{R}^{o \times e}Wo∈Ro×e
    s1s_1s1​
    ycy_cyc​
    y1y_1y1​
    yiy_{i}yi​
    yi−1y_{i-1}yi−1​
    si−1s_{i-1}si−1​
    yty_tyt​
    s1s_1s1​
    ycy_cyc​
    hch_chc​
    y0y_0y0​
    yiy_iyi​
    P(yi∣{yc}∪{y1,…,yi−1})=q(yc,yi−1,si−1)P(y_{i} | \{y_c\} \cup \{y_1, \ldots, y_{i-1}\}) = q(y_c, y_{i-1}, s_{i-1})P(yi​∣{yc​}∪{y1​,…,yi−1​})=q(yc​,yi−1​,si−1​)
    qqq
    ycy_cyc​
    yi−1y_{i-1}yi−1​
    si−1s_{i-1}si−1​
    yiy_iyi​
    y0=s0=0y_0 = s_0 = \mathbf{0}y0​=s0​=0
    P(Y)=∏i=1m+1q(yc,yi−1,si−1)P(Y) = \prod_{i=1}^{m+1} q(y_c, y_{i-1}, s_{i-1})P(Y)=i=1∏m+1​q(yc​,yi−1​,si−1​)
    yty_tyt​
    RNN for text classification
    maximum likelihood
    Sequence to Sequence Learning with Neural Networksarrow-up-right
    Neural Machine Translation by Jointly Learning to Align and Translatearrow-up-right
    Figure 2 - An encoder example.
    Figure 3 - A decoder example.

    Subword Tokenization

    Byte Pair Encoding (BPE) is a data compression algorithm that is commonly used in the context of subword tokenization for neural language models. BPE tokenizes text into smaller units, such as subword pieces or characters, to handle out-of-vocabulary words, reduce vocabulary size, and enhance the efficiency of language models.

    hashtag
    Algorithm

    The following describes the steps of BPE in terms of the EM algorithmarrow-up-right:

    1. Initialization: Given a dictionary consisting of all words and their counts in a corpus, the symbol vocabulary is initialized by tokenizing each word into its most basic subword units, such as characters.

    2. Expectation: With the (updated) symbol vocabulary, it calculates the frequency of every symbol pair within the vocabulary.

    3. Maximization: Given all symbol pairs and their frequencies, it merges the top-k most frequent symbol pairs in the vocabulary.

    4. Steps 2 and 3 are repeated until meaningful sets of subwords are found for all words in the corpus.

    circle-exclamation

    Q4: The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?

    hashtag
    Implementation

    Let us consider a toy vocabulary:

    First, we create the symbol vocabulary by inserting a space between every pair of adjacent characters and adding a special symbol [EoW] at the end to indicate the End of the Word:

    Next, we count the frequencies of all symbol pairs in the vocabulary:

    Finally, we update the vocabulary by merging the most frequent symbol pair across all words:

    The expect() and maximize() can be repeated for multiple iterations until the tokenization becomes reasonable:

    When you uncomment L7 in bpe_vocab(), you can see how the symbols are merged in each iteration:

    circle-exclamation

    Q5: What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?

    hashtag
    References

    Source code:

    • , Sennrich et al., ACL, 2016.

    • , Gage, The C Users Journal, 1994.

    • , Kudo and Richardson, EMNLP, 2018.

    (WordPiece), Wu et al., arXiv, 2016.
    rule-based tokenization
    src/byte_pair_encoding.pyarrow-up-right
    Neural Machine Translation of Rare Words with Subword Unitsarrow-up-right
    A New Algorithm for Data Compressionarrow-up-right
    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processingarrow-up-right
    from src.types import WordCount, PairCount
    EOW = '[EoW]'
    
    word_counts = {
        'high': 12,
        'higher': 14,
        'highest': 10,
        'low': 12,
        'lower': 11,
        'lowest': 13
    }
    def initialize(word_counts: WordCount) -> WordCount:
        return {' '.join(list(word) + [EOW]): count for word, count in word_counts.items()}
    def expect(vocab: WordCount) -> PairCount:
        pairs = collections.defaultdict(int)
    
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[symbols[i], symbols[i + 1]] += freq
    
        return pairs
    def maximize(vocab: WordCount, pairs: PairCount) -> WordCount:
        best = max(pairs, key=pairs.get)
        p = re.compile(r'(?<!\S)' + re.escape(' '.join(best)) + r'(?!\S)')
        return {p.sub(''.join(best), word): freq for word, freq in vocab.items()}
    def bpe_vocab(word_counts: WordCount, max_iter: int):
        vocab = initialize(word_counts)
    
        for i in range(max_iter):
            pairs = expect(vocab)
            vocab = maximize(vocab, pairs)
            # print(vocab)
    
        return vocab
    bpe_vocab(word_counts, 10)
    {'hi g h [EoW]': 12, 'hi g h e r [EoW]': 14, 'hi g h e s t [EoW]': 10, 'l o w [EoW]': 12, 'l o w e r [EoW]': 11, 'l o w e s t [EoW]': 13}
    {'hig h [EoW]': 12, 'hig h e r [EoW]': 14, 'hig h e s t [EoW]': 10, 'l o w [EoW]': 12, 'l o w e r [EoW]': 11, 'l o w e s t [EoW]': 13}
    {'high [EoW]': 12, 'high e r [EoW]': 14, 'high e s t [EoW]': 10, 'l o w [EoW]': 12, 'l o w e r [EoW]': 11, 'l o w e s t [EoW]': 13}
    {'high [EoW]': 12, 'high e r [EoW]': 14, 'high e s t [EoW]': 10, 'lo w [EoW]': 12, 'lo w e r [EoW]': 11, 'lo w e s t [EoW]': 13}
    {'high [EoW]': 12, 'high e r [EoW]': 14, 'high e s t [EoW]': 10, 'low [EoW]': 12, 'low e r [EoW]': 11, 'low e s t [EoW]': 13}
    {'high [EoW]': 12, 'high er [EoW]': 14, 'high e s t [EoW]': 10, 'low [EoW]': 12, 'low er [EoW]': 11, 'low e s t [EoW]': 13}
    {'high [EoW]': 12, 'high er[EoW]': 14, 'high e s t [EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low e s t [EoW]': 13}
    {'high [EoW]': 12, 'high er[EoW]': 14, 'high es t [EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low es t [EoW]': 13}
    {'high [EoW]': 12, 'high er[EoW]': 14, 'high est [EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low est [EoW]': 13}
    {'high [EoW]': 12, 'high er[EoW]': 14, 'high est[EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low est[EoW]': 13}
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translationarrow-up-right