1 of 1

Recurrent Neural Networks

Update: 2023-10-26

A Recurrent Neural Network (RNN) [1] maintains hidden states of previous inputs and uses them to predict outputs, allowing it to model temporal dependencies in sequential data.

The hidden state is a vector representing the network's internal memory of the previous time step. It captures information from previous time steps and influences the predictions made at the current time step, often updated at each time step as the RNN processes a sequence of inputs.

RNN for Sequence Tagging

Given an input sequence $X = [x_1, \ldots, x_n]$ where $x_i \in \mathbb{R}^{d \times 1}$ , an RNN for sequence tagging defines two functions, $f$ and $g$ :

$f$ takes the current input $x_i \in X$ and the hidden state $h_{i-1}$ of the previous input $x_{i-1}$ , and returns a hidden state $h_i \in \mathbb{R}^{e \times 1}$ such that $f(x_i, h_{i-1}) = \alpha(W^x x_i + W^h h_{i-1}) = h_i$ , where $W^x \in \mathbb{R}^{e \times d}$ , $W^h \in \mathbb{R}^{e \times e}$ , and $\alpha$ is an activation function.
$g$ takes the hidden state $h_i$ and returns an output $y_i \in \mathbb{R}^{o \times 1}$ such that $g(h_i) = W^o h_i = y_i$ , where $W^o \in \mathbb{R}^{o \times e}$ .

Figure 1 shows an example of an RNN for sequence tagging, such as part-of-speech tagging:

RNN for Text Classification

Bidirectional RNN

The RNN for sequence tagging above does not consider the words that follow the current word when predicting the output. This limitation can significantly impact model performance since contextual information following the current word can be crucial.

For example, let us consider the word "early" in the following two sentences:

They are early birds -> "early" is an adjective.
They are early today -> "early" is an adverb.

The POS tags of "early" depend on the following words, "birds" and "today", such that making the correct predictions becomes challenging without the following context.

To overcome this challenge, a Bidirectional RNN is suggested [2] that considers both forward and backward directions, creating twice as many hidden states to capture a more comprehensive context. Figure 3 illustrates a bidirectional RNN for sequence tagging:

Does it make sense to use bidirectional RNN for text classification? Explain your answer.

Advanced Topics

Long Short-Term Memory (LSTM) Networks [3-5]
Gated Recurrent Units (GRUs) [6-7]

References

Finding Structure in Time, Elman, Cognitive Science, 14(2), 1990.
Bidirectional Recurrent Neural Networks, Schuster and Paliwal, IEEE Transactions on Signal Processing, 45(11), 1997.
Long Short-Term Memory, Hochreiter and Schmidhuber, Neural Computation, 9(8), 1997 (PDF available at ResearchGate).
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Ma and Hovy, ACL, 2016.*
Contextual String Embeddings for Sequence Labeling, Akbik et al., COLING, 2018.*
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho et al., EMNLP, 2014.*
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Chung et al., NeurIPS Workshop on Deep Learning and Representation Learning, 2014.*

Recurrent Neural Networks

Update: 2023-10-26

A Recurrent Neural Network (RNN) [1] maintains hidden states of previous inputs and uses them to predict outputs, allowing it to model temporal dependencies in sequential data.

RNN for Sequence Tagging

Given an input sequence $X = [x_1, \ldots, x_n]$ where $x_i \in \mathbb{R}^{d \times 1}$ , an RNN for sequence tagging defines two functions, $f$ and $g$ :

$f$ takes the current input $x_i \in X$ and the hidden state $h_{i-1}$ of the previous input $x_{i-1}$ , and returns a hidden state $h_i \in \mathbb{R}^{e \times 1}$ such that $f(x_i, h_{i-1}) = \alpha(W^x x_i + W^h h_{i-1}) = h_i$ , where $W^x \in \mathbb{R}^{e \times d}$ , $W^h \in \mathbb{R}^{e \times e}$ , and $\alpha$ is an activation function.
$g$ takes the hidden state $h_i$ and returns an output $y_i \in \mathbb{R}^{o \times 1}$ such that $g(h_i) = W^o h_i = y_i$ , where $W^o \in \mathbb{R}^{o \times e}$ .

Figure 1 shows an example of an RNN for sequence tagging, such as part-of-speech tagging:

Notice that the output $y_1$ for the first input $x_1$ is predicted by considering only the input itself such that $f(x_1, \mathbf{0}) = \alpha(W^x x_1) = h_1$ (e.g., the POS tag of the first word "I" is predicted solely using that word). However, the output $y_i$ for every other input $x_i$ is predicted by considering both $x_i$ and $h_{i-1}$ , an intermediate representation created explicitly for the task. This enables RNNs to capture sequential information that Feedforward Neural Networks cannot.

What does each hidden state $h_i$ represent in the RNN for sequence tagging?

RNN for Text Classification

Unlike sequence tagging where the RNN predicts a sequence of output $Y = [y_1, \ldots, y_n]$ for the input $X = [x_1, \ldots, x_n]$ , an RNN designed for text classification predicts only one output $y$ for the entire input sequence such that:

Sequence Tagging $\text{RNN}_{st}(X) \rightarrow Y$
Text Classification: $\text{RNN}_{st}(X) \rightarrow y$

To accomplish this, a common practice is to predict the output $y$ from the last hidden state $h_n$ using the function $g$ . Figure 2 shows an example of an RNN for text classification, such as sentiment analysis:

What does the hidden state $h_n$ represent in the RNN for text classification?

Bidirectional RNN

For example, let us consider the word "early" in the following two sentences:

They are early birds -> "early" is an adjective.
They are early today -> "early" is an adverb.

The POS tags of "early" depend on the following words, "birds" and "today", such that making the correct predictions becomes challenging without the following context.

For every $x_i$ , the hidden states $\overrightarrow{h}_i$ and $\overleftarrow{h}_i$ are created by considering $\overrightarrow{h}_{i-1}$ and $\overleftarrow{h}_{i+1}$ , respectively. The function $g$ takes both $\overrightarrow{h}_i$ and $\overleftarrow{h}_i$ and returns an output $y_i \in \mathbb{R}^{o \times 1}$ such that $g(\overrightarrow{h}_i, \overleftarrow{h}_i) = W^o (\overrightarrow{h}_i \oplus \overleftarrow{h}_i) = y_i$ , where $(\overrightarrow{h}_i \oplus \overleftarrow{h}_i) \in \mathbb{R}^{2e \times 1}$ is a concatenation of the two hidden states and $W^o \in \mathbb{R}^{o \times 2e}$ .

Does it make sense to use bidirectional RNN for text classification? Explain your answer.

Advanced Topics

Long Short-Term Memory (LSTM) Networks [3-5]
Gated Recurrent Units (GRUs) [6-7]

References

Finding Structure in Time, Elman, Cognitive Science, 14(2), 1990.
Bidirectional Recurrent Neural Networks, Schuster and Paliwal, IEEE Transactions on Signal Processing, 45(11), 1997.
Long Short-Term Memory, Hochreiter and Schmidhuber, Neural Computation, 9(8), 1997 (PDF available at ResearchGate).
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Ma and Hovy, ACL, 2016.*
Contextual String Embeddings for Sequence Labeling, Akbik et al., COLING, 2018.*
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho et al., EMNLP, 2014.*
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Chung et al., NeurIPS Workshop on Deep Learning and Representation Learning, 2014.*