NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Logistic Regression
  • Softmax Regression
  • Multilayer Perceptron
  • References

Was this helpful?

Export as PDF
  1. Distributional Semantics

Neural Networks

PreviousLatent Semantic AnalysisNextWord2Vec

Last updated 1 month ago

Was this helpful?

Logistic Regression

Let x=[x1,…,xn]\mathrm{x} = [x_1, \ldots, x_n]x=[x1​,…,xn​] be a vector representing an input instance, where xix_ixi​ denotes the iii'th feature of the input and y∈{0,1}y \in \{0, 1\}y∈{0,1} be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that x\mathrm{x}x belongs to yyy:

P(y=1∣x)=11+e−(x⋅wT+b)P(y=0∣x)=1−P(y=1∣x)\begin{align*} P(y=1|\mathrm{x}) &= \frac{1}{1 + e^{-(\mathrm{x}\cdot\mathrm{w}^T+b)}}\\ P(y=0|\mathrm{x}) &= 1 - P(y=1|\mathrm{x}) \end{align*}P(y=1∣x)P(y=0∣x)​=1+e−(x⋅wT+b)1​=1−P(y=1∣x)​

The weight vector w=[w1,…,wn]\mathrm{w}=[w_1, \ldots, w_n]w=[w1​,…,wn​] assigns weights to each dimension of the input vector x\mathrm{x}x for the label y=1y=1y=1 such that a higher magnitude of weight wiw_iwi​ indicates greater importance of the feature xix_ixi​. Finally, bbb represents the bias of the label y=1y=1y=1 within the training distribution.

Q7: What role does the sigmoid function play in the logistic regression model?

Consider a corpus consisting of two sentences:

D1: I love this movie

D2: I hate this movie

The input vectors x1\mathrm{x}_1x1​ and x2\mathrm{x}_2x2​ can be created for these two sentences using the :

V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie"}
x1 = [1, 1, 0, 1, 1]
x2 = [1, 0, 1, 1, 1]

Let y1=1y_1 =1y1​=1 and y2=0y_2 = 0y2​=0 be the output labels of x1\mathrm{x}_1x1​and x2\mathrm{x}_2x2​, representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector w\mathrm{w}w can be trained using logistic regression:

w = [0.0, 1.5, -1.5, 0.0, 0.0]
b = 0

Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights w1​w_1​w1​​, w4​w_4​w4​​, and w5​w_5​w5​​ are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight w1​w_1​w1​​ for "love" (x1​x_1​x1​​) contributes positively to the label y=1y=1y=1, the weight w2w_2w2​ for "hate" (x2​x_2​x2​​) has a negative impact on the label y=1y=1y=1. Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias bbb is also set to 0.

Given the weight vector and the bias, we have x1⋅wT+b=1.5\mathrm{x}_1 \cdot \mathrm{w}^T + b = 1.5x1​⋅wT+b=1.5 and x2⋅wT+b=−1.5\mathrm{x}_2 \cdot \mathrm{w}^T + b = -1.5x2​⋅wT+b=−1.5, resulting the following probabilities:

P(y=1∣x1)≈0.82P(y=1∣x2)≈0.18\begin{align*} P(y=1|\mathrm{x}_1) \approx 0.82\\ P(y=1|\mathrm{x}_2) \approx 0.18 \end{align*}P(y=1∣x1​)≈0.82P(y=1∣x2​)≈0.18​

As the probability of x1​x_1​x1​​ being y=1y=1y=1 exceeds 0.50.50.5 (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being y=1y=1y=1 is below 50%.

Q8: Under what circumstances would the bias bbb be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?

Softmax Regression

Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n and its output lable y∈{0,…,m−1}y \in \{0, \ldots, m-1\}y∈{0,…,m−1}, the model uses the softmax function to estimates the probability that x\mathrm{x}x belongs to each class separately:

P(y=k∣x)=ex⋅wkT+bk∑j=1mex⋅wjT+bjP(y=k|\mathrm{x}) = \frac{e^{\mathrm{x}\cdot\mathrm{w}_k^T+b_k}}{\sum_{j=1}^m e^{\mathrm{x}\cdot\mathrm{w}_j^T+b_j}}P(y=k∣x)=∑j=1m​ex⋅wjT​+bj​ex⋅wkT​+bk​​

The weight vector wk​\mathrm{w}_k​wk​​ assigns weights to x\mathrm{x}x for the label y=ky=ky=k, while bk​b_k​bk​​ represents the bias associated with the label y=ky=ky=k.

Q9: What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?

Consider a corpus consisting of three sentences:

D1: I love this movie

D2: I hate this movie

D3: I watched this movie

V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched"}
x1 = [1, 1, 0, 1, 1, 0]
x2 = [1, 0, 1, 1, 1, 0]
x3 = [1, 0, 0, 1, 1, 1]

Let y1=1y_1 =1y1​=1, y2=0y_2 = 0y2​=0, and y=2y=2y=2 be the output labels of x1\mathrm{x}_1x1​, x2\mathrm{x}_2x2​, and x3\mathrm{x}_3x3​, representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors w1\mathrm{w}_1w1​, w2\mathrm{w}_2w2​, and w3\mathrm{w}_3w3​ can be trained using softmax regression as follows:

w1 = [0.0,  1.5, -1.0, 0.0, 0.0, 0.0]
w2 = [0.0, -1.0,  1.5, 0.0, 0.0, 0.0]
w3 = [0.0, -1.0, -1.0, 0.0, 0.0, 1.5]
b1 = b2 = b3 = 0

Unlike the case of logistic regression where all weights are oriented to y=1y = 1y=1 (both w1w_1w1​ and w2w_2w2​ giving positive and negative weights to y=1y = 1y=1 respectively, but not y=0y=0y=0), the values in each weigh vector are oriented to each corresponding label.

Given the weight vectors and the biases, we can estimate the following probabilities for x1\mathrm{x}_1x1​:

x1⋅w1T+b1=1.5 ⇒P(y=1∣x1)=0.86x1⋅w2T+b2=−1.0⇒P(y=0∣x1)=0.07x1⋅w3T+b3=−1.0⇒P(y=2∣x1)=0.07\begin{align*} \mathrm{x}_1 \cdot \mathrm{w}_1^T + b_1 &=& 1.5  &\Rightarrow & P(y=1|\mathrm{x}_1) &=& 0.86\\ \mathrm{x}_1 \cdot \mathrm{w}_2^T + b_2 &=& -1.0 &\Rightarrow & P(y=0|\mathrm{x}_1) &=& 0.07\\ \mathrm{x}_1 \cdot \mathrm{w}_3^T + b_3 &=& -1.0 &\Rightarrow & P(y=2|\mathrm{x}_1) &=& 0.07 \end{align*}x1​⋅w1T​+b1​x1​⋅w2T​+b2​x1​⋅w3T​+b3​​===​1.5 −1.0−1.0​⇒⇒⇒​P(y=1∣x1​)P(y=0∣x1​)P(y=2∣x1​)​===​0.860.070.07​

Since the probabiilty of y=1y=1y=1 is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For x3\mathrm{x}_3x3​, the following probabilities can be estimated:

x3⋅w1T+b1=0 ⇒P(y=1∣x3)=0.14x3⋅w2T+b2=0⇒P(y=0∣x3)=0.14x3⋅w3T+b3=1.5⇒P(y=2∣x3)=0.69\begin{align*} \mathrm{x}_3 \cdot \mathrm{w}_1^T + b_1 &=& 0  &\Rightarrow & P(y=1|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_2^T + b_2 &=& 0 &\Rightarrow & P(y=0|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_3^T + b_3 &=& 1.5 &\Rightarrow & P(y=2|\mathrm{x}_3) &=& 0.69 \end{align*}x3​⋅w1T​+b1​x3​⋅w2T​+b2​x3​⋅w3T​+b3​​===​0 01.5​⇒⇒⇒​P(y=1∣x3​)P(y=0∣x3​)P(y=2∣x3​)​===​0.140.140.69​

Since the probabiilty of y=2y=2y=2 is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.

Softmax regression always predicts mmm values so that it is represented by an output vector y∈R1×m\mathrm{y} \in \mathbb{R}^{1 \times m}y∈R1×m, wherein the iii'th value in y\mathrm{y}y contains the probability of the input belonging to the iii'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix W∈Rm×n\mathrm{W} \in \mathbb{R}^{m \times n}W∈Rm×n, where the iii'th row represents the weight vector for the iii'th label.

With this new formulation, softmax regression can be defined as y=softmax(x⋅WT)\mathrm{y} = \mathrm{softmax}(\mathrm{x} \cdot \mathrm{W}^T)y=softmax(x⋅WT), and the optimal prediction can be achieved as argmax(y)\mathrm{argmax}(\mathrm{y})argmax(y), which returns a set of labels with the highest probabilities.

What are the limitations of the softmax regression model?

Multilayer Perceptron

h=activation(x⋅Wx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)h=activation(x⋅Wx​)

Similarly, to connect h\mathrm{h}h and y\mathrm{y}y, we need a weight matrix Wh∈Rm×d\mathrm{W}_h \in \mathbb{R}^{m \times d}Wh​∈Rm×d such that y=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax(h⋅WhT​). Thus, a multilayer perceptron with one hidden layer can be represented as:

y=softmax[activation(x⋅Wx)⋅WhT]=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}[\mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x) \cdot \mathrm{W}_h^T] = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax[activation(x⋅Wx​)⋅WhT​]=softmax(h⋅WhT​)

Q10: Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?

Consider a corpus comprising the following five sentences the corresponding labels (⇒\Rightarrow⇒):

D1: I love this movie ⇒\Rightarrow⇒ postive

D2: I hate this movie ⇒\Rightarrow⇒ negative

D3: I watched this movie ⇒\Rightarrow⇒ neutral

D4: I truly love this movie ⇒\Rightarrow⇒ very positive

D5: I truly hate this movie ⇒\Rightarrow⇒ very negative

X = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched", 6: "truly"}
Y = {0: "positive", 1: "negative", 2: "neutral", 3: "very positive", 4: "very negative"}

x1 = [1, 1, 0, 1, 1, 0, 0]
x2 = [1, 0, 1, 1, 1, 0, 0]
x3 = [1, 0, 0, 1, 1, 1, 0]
x4 = [1, 1, 0, 1, 1, 0, 1]
x5 = [1, 0, 1, 1, 1, 0, 1]

y1, y2, y3, y4, y5 = 0, 1, 2, 3, 4

Q11: What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?

The first weight matrix Wx∈R7×5\mathrm{W}_x \in \mathbb{R}^{7 \times 5}Wx​∈R7×5 can be trained by an MLP as follows:

Wx = [
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [1.0, 0.0, 0.0, 0.5, 0.0],
  [0.0, 1.0, 0.0, 0.0, 0.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 1.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.5, 0.5]
]

Given the values in Wx\mathrm{W}_xWx​, we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.

Each of x1..5\mathrm{x}_{1..5}x1..5​ is multiplied by Wx\mathrm{W}_xWx​ to achieve the hiddner layer h1..5\mathrm{h}_{1..5}h1..5​, respectively, where the activation function is designed as follow:

activation(x)={ xif x>0.5 0otherwise\mathrm{activation}(x) = \left\{ \begin{array}{ll}   x & \text{if}\: x > 0.5\\   0 & \text{otherwise} \end{array} \right.activation(x)={ x 0​ifx>0.5otherwise​
g1 = [1.0, 0.0, 0.0, 0.5, 0.0]
g2 = [0.0, 1.0, 0.0, 0.0, 0.5]
g3 = [0.0, 0.0, 1.0, 0.0, 0.0]
g4 = [1.0, 0.0, 0.0, 1.0, 0.5]
g5 = [0.0, 1.0, 0.0, 0.5, 1.0]
h1 = activation(g1) = [1.0, 0.0, 0.0, 0.0, 0.0]
h2 = activation(g2) = [0.0, 1.0, 0.0, 0.0, 0.0]
h3 = activation(g3) = [0.0, 0.0, 1.0, 0.0, 0.0]
h4 = activation(g4) = [1.0, 0.0, 0.0, 1.0, 0.0]
h5 = activation(g5) = [0.0, 1.0, 0.0, 0.0, 1.0]

The second weight matrix Wh∈R5×5\mathrm{W}_h \in \mathbb{R}^{5 \times 5}Wh​∈R5×5 can also be trained by an MLP as follows:

Wh = [
  [ 1.0, -1.0, 0.0, -0.5, -1.0],
  [-1.0,  1.0, 0.0, -1.0, -0.5],
  [-1.0, -1.0, 1.0, -1.0, -1.0],
  [ 0.0, -1.0, 0.0,  1.0, -1.0],
  [-1.0,  0.0, 0.0, -1.0,  1.0]
]

By applying the softmax function to each hi⋅ WhT\mathrm{h}_i \cdot \mathrm{W}^T_hhi​⋅ WhT​, we achieve the corresponding output vector yi\mathrm{y}_iyi​:

o1 = [ 1.0, -1.0, -1.0,  0.0, -1.0]
o2 = [-1.0,  1.0, -1.0, -1.0,  0.0]
o3 = [ 0.0,  0.0,  1.0,  0.0,  0.0]
o4 = [ 0.5, -2.0, -2.0,  1.0, -2.0]
o5 = [-2.0,  0.5, -2.0, -2.0,  1.0]
y1 = softmax(o1) = [0.56, 0.08, 0.08, 0.21, 0.08]
y2 = softmax(o2) = [0.08, 0.56, 0.08, 0.08, 0.21]
y3 = softmax(o3) = [0.15, 0.15, 0.40, 0.15, 0.15]
y4 = softmax(o4) = [0.35, 0.03, 0.03, 0.57, 0.03]
y5 = softmax(o5) = [0.03, 0.35, 0.03, 0.03, 0.57]

The prediction can be made by taking the argmax of each yi\mathrm{y}_iyi​.

Q12: What are the limitations of a multilayer perceptron?

References

Then, the input vectors x1\mathrm{x}_1x1​, x2\mathrm{x}_2x2​, and x3\mathrm{x}_3x3​ for the sentences can be created using the :

A multilayer perceptron (MLP) is a type of consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n and an output vector y∈R1×m\mathrm{y} \in \mathbb{R}^{1 \times m}y∈R1×m, the model allows zero to many hidden layers to generate intermediate representations of the input.

Let h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d be a hidden layer between x\mathrm{x}x and y\mathrm{y}y. To connect x\mathrm{x}x and h\mathrm{h}h, we need a weight matrix Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d such that h=activation(x⋅Wx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)h=activation(x⋅Wx​), where activation()\mathrm{activation}()activation() is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.

The input vectors x1..5\mathrm{x}_{1..5}x1..5​ can be created using the :

, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.

bag-of-words model
bag-of-words model
Feedforward Neural Networks
Activation functions
bag-of-words model
Neural Network Methodologies and their Potential Application to Cloud Pattern Recognition
Types of decision regions that can be formed by different layers of MLP [1]/