Neural Networks

Logistic Regression

Let x=[x1,,xn]\mathrm{x} = [x_1, \ldots, x_n] be a vector representing an input instance, where xix_i denotes the ii'th feature of the input and y{0,1}y \in \{0, 1\} be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that x\mathrm{x} belongs to yy:

P(y=1x)=11+e(xwT+b)P(y=0x)=1P(y=1x)\begin{align*} P(y=1|\mathrm{x}) &= \frac{1}{1 + e^{-(\mathrm{x}\cdot\mathrm{w}^T+b)}}\\ P(y=0|\mathrm{x}) &= 1 - P(y=1|\mathrm{x}) \end{align*}

The weight vector w=[w1,,wn]\mathrm{w}=[w_1, \ldots, w_n] assigns weights to each dimension of the input vector x\mathrm{x} for the label y=1y=1 such that a higher magnitude of weight wiw_i indicates greater importance of the feature xix_i. Finally, bb represents the bias of the label y=1y=1 within the training distribution.

What role does the sigmoid function play in the logistic regression model?

Consider a corpus consisting of two sentences:

D1: I love this movie

D2: I hate this movie

The input vectors x1\mathrm{x}_1 and x2\mathrm{x}_2 can be created for these two sentences using the bag-of-words model:

V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie"}
x1 = [1, 1, 0, 1, 1]
x2 = [1, 0, 1, 1, 1]

Let y1=1y_1 =1 and y2=0y_2 = 0 be the output labels of x1\mathrm{x}_1and x2\mathrm{x}_2, representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector w\mathrm{w} can be trained using logistic regression:

w = [0.0, 1.5, -1.5, 0.0, 0.0]
b = 0

Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights w1w_1​, w4w_4​, and w5w_5​ are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight w1w_1​ for "love" (x1x_1​) contributes positively to the label y=1y=1, the weight w2w_2 for "hate" (x2x_2​) has a negative impact on the label y=1y=1. Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias bb is also set to 0.

Given the weight vector and the bias, we have x1wT+b=1.5\mathrm{x}_1 \cdot \mathrm{w}^T + b = 1.5 and x2wT+b=1.5\mathrm{x}_2 \cdot \mathrm{w}^T + b = -1.5, resulting the following probabilities:

P(y=1x1)0.82P(y=1x2)0.18\begin{align*} P(y=1|\mathrm{x}_1) \approx 0.82\\ P(y=1|\mathrm{x}_2) \approx 0.18 \end{align*}

As the probability of x1x_1​ being y=1y=1 exceeds 0.50.5 (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being y=1y=1 is below 50%.

Under what circumstances would the bias bb be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?

Softmax Regression

Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector xR1×n\mathrm{x} \in \mathbb{R}^{1 \times n} and its output lable y{0,,m1}y \in \{0, \ldots, m-1\}, the model uses the softmax function to estimates the probability that x\mathrm{x} belongs to each class separately:

P(y=kx)=exwkT+bkj=1mexwjT+bjP(y=k|\mathrm{x}) = \frac{e^{\mathrm{x}\cdot\mathrm{w}_k^T+b_k}}{\sum_{j=1}^m e^{\mathrm{x}\cdot\mathrm{w}_j^T+b_j}}

The weight vector wk\mathrm{w}_k​ assigns weights to x\mathrm{x} for the label y=ky=k, while bkb_k​ represents the bias associated with the label y=ky=k.

What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?

Consider a corpus consisting of three sentences:

D1: I love this movie

D2: I hate this movie

D3: I watched this movie

Then, the input vectors x1\mathrm{x}_1, x2\mathrm{x}_2, and x3\mathrm{x}_3 for the sentences can be created using the bag-of-words model:

V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched"}
x1 = [1, 1, 0, 1, 1, 0]
x2 = [1, 0, 1, 1, 1, 0]
x3 = [1, 0, 0, 1, 1, 1]

Let y1=1y_1 =1, y2=0y_2 = 0, and y=2y=2 be the output labels of x1\mathrm{x}_1, x2\mathrm{x}_2, and x3\mathrm{x}_3, representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors w1\mathrm{w}_1, w2\mathrm{w}_2, and w3\mathrm{w}_3 can be trained using softmax regression as follows:

w1 = [0.0,  1.5, -1.0, 0.0, 0.0, 0.0]
w2 = [0.0, -1.0,  1.5, 0.0, 0.0, 0.0]
w3 = [0.0, -1.0, -1.0, 0.0, 0.0, 1.5]
b1 = b2 = b3 = 0

Unlike the case of logistic regression where all weights are oriented to y=1y = 1 (both w1w_1 and w2w_2 giving positive and negative weights to y=1y = 1 respectively, but not y=0y=0), the values in each weigh vector are oriented to each corresponding label.

Given the weight vectors and the biases, we can estimate the following probabilities for x1\mathrm{x}_1:

x1w1T+b1=1.5 P(y=1x1)=0.86x1w2T+b2=1.0P(y=0x1)=0.07x1w3T+b3=1.0P(y=2x1)=0.07\begin{align*} \mathrm{x}_1 \cdot \mathrm{w}_1^T + b_1 &=& 1.5  &\Rightarrow & P(y=1|\mathrm{x}_1) &=& 0.86\\ \mathrm{x}_1 \cdot \mathrm{w}_2^T + b_2 &=& -1.0 &\Rightarrow & P(y=0|\mathrm{x}_1) &=& 0.07\\ \mathrm{x}_1 \cdot \mathrm{w}_3^T + b_3 &=& -1.0 &\Rightarrow & P(y=2|\mathrm{x}_1) &=& 0.07 \end{align*}

Since the probabiilty of y=1y=1 is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For x3\mathrm{x}_3, the following probabilities can be estimated:

x3w1T+b1=0 P(y=1x3)=0.14x3w2T+b2=0P(y=0x3)=0.14x3w3T+b3=1.5P(y=2x3)=0.69\begin{align*} \mathrm{x}_3 \cdot \mathrm{w}_1^T + b_1 &=& 0  &\Rightarrow & P(y=1|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_2^T + b_2 &=& 0 &\Rightarrow & P(y=0|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_3^T + b_3 &=& 1.5 &\Rightarrow & P(y=2|\mathrm{x}_3) &=& 0.69 \end{align*}

Since the probabiilty of y=2y=2 is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.

Softmax regression always predicts mm values so that it is represented by an output vector yR1×m\mathrm{y} \in \mathbb{R}^{1 \times m}, wherein the ii'th value in y\mathrm{y} contains the probability of the input belonging to the ii'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix WRm×n\mathrm{W} \in \mathbb{R}^{m \times n}, where the ii'th row represents the weight vector for the ii'th label.

With this new formulation, softmax regression can be defined as y=softmax(xWT)\mathrm{y} = \mathrm{softmax}(\mathrm{x} \cdot \mathrm{W}^T), and the optimal prediction can be achieved as argmax(y)\mathrm{argmax}(\mathrm{y}), which returns a set of labels with the highest probabilities.

What are the limitations of the softmax regression model?

Multilayer Perceptron

A multilayer perceptron (MLP) is a type of Feedforward Neural Networks consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector xR1×n\mathrm{x} \in \mathbb{R}^{1 \times n} and an output vector yR1×m\mathrm{y} \in \mathbb{R}^{1 \times m}, the model allows zero to many hidden layers to generate intermediate representations of the input.

Let hR1×d\mathrm{h} \in \mathbb{R}^{1 \times d} be a hidden layer between x\mathrm{x} and y\mathrm{y}. To connect x\mathrm{x} and h\mathrm{h}, we need a weight matrix WxRn×d\mathrm{W}_x \in \mathbb{R}^{n \times d} such that h=activation(xWx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x), where activation()\mathrm{activation}() is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Activation functions determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.

h=activation(xWx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)

Similarly, to connect h\mathrm{h} and y\mathrm{y}, we need a weight matrix WhRm×d\mathrm{W}_h \in \mathbb{R}^{m \times d} such that y=softmax(hWhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T). Thus, a multilayer perceptron with one hidden layer can be represented as:

y=softmax[activation(xWx)WhT]=softmax(hWhT)\mathrm{y} = \mathrm{softmax}[\mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x) \cdot \mathrm{W}_h^T] = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)

Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?

Consider a corpus comprising the following five sentences the corresponding labels (\Rightarrow):

D1: I love this movie \Rightarrow postive

D2: I hate this movie \Rightarrow negative

D3: I watched this movie \Rightarrow neutral

D4: I truly love this movie \Rightarrow very positive

D5: I truly hate this movie \Rightarrow very negative

The input vectors x1..5\mathrm{x}_{1..5} can be created using the bag-of-words model:

X = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched", 6: "truly"}
Y = {0: "positive", 1: "negative", 2: "neutral", 3: "very positive", 4: "very negative"}

x1 = [1, 1, 0, 1, 1, 0, 0]
x2 = [1, 0, 1, 1, 1, 0, 0]
x3 = [1, 0, 0, 1, 1, 1, 0]
x4 = [1, 1, 0, 1, 1, 0, 1]
x5 = [1, 0, 1, 1, 1, 0, 1]

y1, y2, y3, y4, y5 = 0, 1, 2, 3, 4

What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?

The first weight matrix WxR7×5\mathrm{W}_x \in \mathbb{R}^{7 \times 5} can be trained by an MLP as follows:

Wx = [
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [1.0, 0.0, 0.0, 0.5, 0.0],
  [0.0, 1.0, 0.0, 0.0, 0.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 1.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.5, 0.5]
]

Given the values in Wx\mathrm{W}_x, we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.

Each of x1..5\mathrm{x}_{1..5} is multiplied by Wx\mathrm{W}_x to achieve the hiddner layer h1..5\mathrm{h}_{1..5}, respectively, where the activation function is designed as follow:

activation(x)={ xifx>0.5 0otherwise\mathrm{activation}(x) = \left\{ \begin{array}{ll}   x & \text{if}\: x > 0.5\\   0 & \text{otherwise} \end{array} \right.
g1 = [1.0, 0.0, 0.0, 0.5, 0.0]
g2 = [0.0, 1.0, 0.0, 0.0, 0.5]
g3 = [0.0, 0.0, 1.0, 0.0, 0.0]
g4 = [1.0, 0.0, 0.0, 1.0, 0.5]
g5 = [0.0, 1.0, 0.0, 0.5, 1.0]

The second weight matrix WhR5×5\mathrm{W}_h \in \mathbb{R}^{5 \times 5} can also be trained by an MLP as follows:

Wh = [
  [ 1.0, -1.0, 0.0, -0.5, -1.0],
  [-1.0,  1.0, 0.0, -1.0, -0.5],
  [-1.0, -1.0, 1.0, -1.0, -1.0],
  [ 0.0, -1.0, 0.0,  1.0, -1.0],
  [-1.0,  0.0, 0.0, -1.0,  1.0]
]

By applying the softmax function to each hi WhT\mathrm{h}_i \cdot \mathrm{W}^T_h, we achieve the corresponding output vector yi\mathrm{y}_i:

o1 = [ 1.0, -1.0, -1.0,  0.0, -1.0]
o2 = [-1.0,  1.0, -1.0, -1.0,  0.0]
o3 = [ 0.0,  0.0,  1.0,  0.0,  0.0]
o4 = [ 0.5, -2.0, -2.0,  1.0, -2.0]
o5 = [-2.0,  0.5, -2.0, -2.0,  1.0]

The prediction can be made by taking the argmax of each yi\mathrm{y}_i.

What are the limitations of a multilayer perceptron?

References

  1. Neural Network Methodologies and their Potential Application to Cloud Pattern Recognition, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.

Last updated

Copyright © 2023 All rights reserved