NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • RNN for Sequence Tagging
  • RNN for Text Classification
  • Bidirectional RNN
  • Advanced Topics
  • References

Was this helpful?

Export as PDF
  1. Contextual Encoding

Recurrent Neural Networks

Update: 2023-10-26

PreviousSubword TokenizationNextTransformer

Last updated 1 month ago

Was this helpful?

A Recurrent Neural Network (RNN) [1] maintains hidden states of previous inputs and uses them to predict outputs, allowing it to model temporal dependencies in sequential data.

The hidden state is a vector representing the network's internal memory of the previous time step. It captures information from previous time steps and influences the predictions made at the current time step, often updated at each time step as the RNN processes a sequence of inputs.

RNN for Sequence Tagging

Given an input sequence X=[x1,…,xn]X = [x_1, \ldots, x_n]X=[x1​,…,xn​] where xi∈Rd×1x_i \in \mathbb{R}^{d \times 1}xi​∈Rd×1, an RNN for defines two functions, fff and ggg:

  • fff takes the current input xi∈Xx_i \in Xxi​∈X and the hidden state hi−1h_{i-1}hi−1​ of the previous input xi−1x_{i-1}xi−1​, and returns a hidden state hi∈Re×1h_i \in \mathbb{R}^{e \times 1}hi​∈Re×1 such that f(xi,hi−1)=α(Wxxi+Whhi−1)=hif(x_i, h_{i-1}) = \alpha(W^x x_i + W^h h_{i-1}) = h_if(xi​,hi−1​)=α(Wxxi​+Whhi−1​)=hi​, where Wx∈Re×dW^x \in \mathbb{R}^{e \times d}Wx∈Re×d, Wh∈Re×eW^h \in \mathbb{R}^{e \times e}Wh∈Re×e, and α\alphaα is an .

  • ggg takes the hidden state hih_ihi​ and returns an output yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1 such that g(hi)=Wohi=yig(h_i) = W^o h_i = y_ig(hi​)=Wohi​=yi​, where Wo∈Ro×eW^o \in \mathbb{R}^{o \times e}Wo∈Ro×e.

Figure 1 shows an example of an RNN for sequence tagging, such as :

RNN for Text Classification

Bidirectional RNN

For example, let us consider the word "early" in the following two sentences:

  • They are early birds -> "early" is an adjective.

  • They are early today -> "early" is an adverb.

The POS tags of "early" depend on the following words, "birds" and "today", such that making the correct predictions becomes challenging without the following context.

To overcome this challenge, a Bidirectional RNN is suggested [2] that considers both forward and backward directions, creating twice as many hidden states to capture a more comprehensive context. Figure 3 illustrates a bidirectional RNN for sequence tagging:

Q8: What are the advantages and limitations of implementing bidirectional RNNs for text classification and sequence tagging tasks?

Advanced Topics

  • Long Short-Term Memory (LSTM) Networks [3-5]

  • Gated Recurrent Units (GRUs) [6-7]

References

Notice that the output y1y_1y1​ for the first input x1x_1x1​ is predicted by considering only the input itself such that f(x1,0)=α(Wxx1)=h1f(x_1, \mathbf{0}) = \alpha(W^x x_1) = h_1f(x1​,0)=α(Wxx1​)=h1​ (e.g., the POS tag of the first word "I" is predicted solely using that word). However, the output yiy_iyi​ for every other input xix_ixi​ is predicted by considering both xix_ixi​ and hi−1h_{i-1}hi−1​, an intermediate representation created explicitly for the task. This enables RNNs to capture sequential information that cannot.

Q6: How does each hidden state hih_ihi​ in a RNN encode information relevant to sequence tagging tasks?

Unlike sequence tagging where the RNN predicts a sequence of output Y=[y1,…,yn]Y = [y_1, \ldots, y_n]Y=[y1​,…,yn​] for the input X=[x1,…,xn]X = [x_1, \ldots, x_n]X=[x1​,…,xn​], an RNN designed for predicts only one output yyy for the entire input sequence such that:

Sequence TaggingRNNst(X)→Y\text{RNN}_{st}(X) \rightarrow YRNNst​(X)→Y

Text Classification: RNNst(X)→y\text{RNN}_{st}(X) \rightarrow yRNNst​(X)→y

To accomplish this, a common practice is to predict the output yyy from the last hidden state hnh_nhn​ using the function ggg. Figure 2 shows an example of an RNN for text classification, such as :

Q7: In text classification tasks, what specific information is captured by the final hidden state hnh_nhn​ of a RNN?

The above does not consider the words that follow the current word when predicting the output. This limitation can significantly impact model performance since contextual information following the current word can be crucial.

For every xix_ixi​, the hidden states h→i\overrightarrow{h}_ihi​ and h←i\overleftarrow{h}_ihi​ are created by considering h→i−1\overrightarrow{h}_{i-1}hi−1​ and h←i+1\overleftarrow{h}_{i+1}hi+1​, respectively. The function ggg takes both h→i\overrightarrow{h}_ihi​ and h←i\overleftarrow{h}_ihi​ and returns an output yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1 such that g(h→i,h←i)=Wo(h→i⊕h←i)=yig(\overrightarrow{h}_i, \overleftarrow{h}_i) = W^o (\overrightarrow{h}_i \oplus \overleftarrow{h}_i) = y_ig(hi​,hi​)=Wo(hi​⊕hi​)=yi​, where (h→i⊕h←i)∈R2e×1(\overrightarrow{h}_i \oplus \overleftarrow{h}_i) \in \mathbb{R}^{2e \times 1}(hi​⊕hi​)∈R2e×1 is a concatenation of the two hidden states and Wo∈Ro×2eW^o \in \mathbb{R}^{o \times 2e}Wo∈Ro×2e.

, Elman, Cognitive Science, 14(2), 1990.

, Schuster and Paliwal, IEEE Transactions on Signal Processing, 45(11), 1997.

, Hochreiter and Schmidhuber, Neural Computation, 9(8), 1997 ( available at ResearchGate).

, Ma and Hovy, ACL, 2016.*

, Akbik et al., COLING, 2018.*

, Cho et al., EMNLP, 2014.*

, Chung et al., NeurIPS Workshop on Deep Learning and Representation Learning, 2014.*

Finding Structure in Time
Bidirectional Recurrent Neural Networks
Long Short-Term Memory
PDF
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
Contextual String Embeddings for Sequence Labeling
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
RNN for sequence tagging
Feedforward Neural Networks
text classification
sequence tagging
activation function
Figure 1 - An example of an RNN and its applicatoin in part-of-speech (POS) tagging.
Figure 2 - An example of an RNN and its applicatoin in sentiment analysis.
Figure 3 - An overview of a bidirectional RNN.
part-of-speech tagging
sentiment analysis