NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Quiz
  • References

Was this helpful?

Export as PDF
  1. Contextual Encoding

Homework

HW5: Contextual Encoding

PreviousEncoder-Decoder FrameworkNextNLP Tasks & Applications

Last updated 1 year ago

Was this helpful?

Quiz

  1. The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?

  2. What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?

  3. How does self-attention operate given an embedding matrix W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d representing a document, where nnn is the number of words and ddd is the embedding dimension?

  4. Given the same embedding matrix as in question #3, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?

  5. What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?

  6. How is a Masked Language Model used in training a language model with a transformer?

  7. How can one train a document-level embedding using a transformer?

  8. What are the advantages of embeddings generated by transformers compared to those generated by ?

  9. Neural networks gained widespread popularity for training natural language processing models since 2013. What factors enabled this popularity, and how do they differ from traditional NLP methods?

  10. Recent large language models like ChatGPT or Claude are trained quite differently from traditional NLP models. What are the main differences, and what factors enabled their development?

References

  • , Vaswani et al., NIPS 2017.

  • , Devlin et al., NAACL 2019.

rule-based tokenization
Word2Vec
Attention is All you Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding