Neural Networks

Logistic Regression

Let x=[x1,,xn]\mathrm{x} = [x_1, \ldots, x_n] be a vector representing an input instance, where xix_i denotes the ii'th feature of the input and y{0,1}y \in \{0, 1\} be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that x\mathrm{x} belongs to yy:

P(y=1x)=11+e(xwT+b)P(y=0x)=1P(y=1x)\begin{align*} P(y=1|\mathrm{x}) &= \frac{1}{1 + e^{-(\mathrm{x}\cdot\mathrm{w}^T+b)}}\\ P(y=0|\mathrm{x}) &= 1 - P(y=1|\mathrm{x}) \end{align*}

The weight vector w=[w1,,wn]\mathrm{w}=[w_1, \ldots, w_n] assigns weights to each dimension of the input vector x\mathrm{x} for the label y=1y=1 such that a higher magnitude of weight wiw_i indicates greater importance of the feature xix_i. Finally, bb represents the bias of the label y=1y=1 within the training distribution.

circle-exclamation

Consider a corpus consisting of two sentences:

D1: I love this movie

D2: I hate this movie

The input vectors x1\mathrm{x}_1 and x2\mathrm{x}_2 can be created for these two sentences using the bag-of-words model:

V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie"}
x1 = [1, 1, 0, 1, 1]
x2 = [1, 0, 1, 1, 1]

Let y1=1y_1 =1 and y2=0y_2 = 0 be the output labels of x1\mathrm{x}_1and x2\mathrm{x}_2, representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector w\mathrm{w} can be trained using logistic regression:

w = [0.0, 1.5, -1.5, 0.0, 0.0]
b = 0

Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights w1w_1​, w4w_4​, and w5w_5​ are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight w1w_1​ for "love" (x1x_1​) contributes positively to the label y=1y=1, the weight w2w_2 for "hate" (x2x_2​) has a negative impact on the label y=1y=1. Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias bb is also set to 0.

Given the weight vector and the bias, we have x1wT+b=1.5\mathrm{x}_1 \cdot \mathrm{w}^T + b = 1.5 and x2wT+b=1.5\mathrm{x}_2 \cdot \mathrm{w}^T + b = -1.5, resulting the following probabilities:

P(y=1x1)0.82P(y=1x2)0.18\begin{align*} P(y=1|\mathrm{x}_1) \approx 0.82\\ P(y=1|\mathrm{x}_2) \approx 0.18 \end{align*}

As the probability of x1x_1​ being y=1y=1 exceeds 0.50.5 (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being y=1y=1 is below 50%.

circle-exclamation

Softmax Regression

Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector xR1×n\mathrm{x} \in \mathbb{R}^{1 \times n} and its output lable y{0,,m1}y \in \{0, \ldots, m-1\}, the model uses the softmax function to estimates the probability that x\mathrm{x} belongs to each class separately:

P(y=kx)=exwkT+bkj=1mexwjT+bjP(y=k|\mathrm{x}) = \frac{e^{\mathrm{x}\cdot\mathrm{w}_k^T+b_k}}{\sum_{j=1}^m e^{\mathrm{x}\cdot\mathrm{w}_j^T+b_j}}

The weight vector wk\mathrm{w}_k​ assigns weights to x\mathrm{x} for the label y=ky=k, while bkb_k​ represents the bias associated with the label y=ky=k.

circle-exclamation

Consider a corpus consisting of three sentences:

D1: I love this movie

D2: I hate this movie

D3: I watched this movie

Then, the input vectors x1\mathrm{x}_1, x2\mathrm{x}_2, and x3\mathrm{x}_3 for the sentences can be created using the bag-of-words model:

Let y1=1y_1 =1, y2=0y_2 = 0, and y=2y=2 be the output labels of x1\mathrm{x}_1, x2\mathrm{x}_2, and x3\mathrm{x}_3, representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors w1\mathrm{w}_1, w2\mathrm{w}_2, and w3\mathrm{w}_3 can be trained using softmax regression as follows:

Unlike the case of logistic regression where all weights are oriented to y=1y = 1 (both w1w_1 and w2w_2 giving positive and negative weights to y=1y = 1 respectively, but not y=0y=0), the values in each weigh vector are oriented to each corresponding label.

Given the weight vectors and the biases, we can estimate the following probabilities for x1\mathrm{x}_1:

x1w1T+b1=1.5 P(y=1x1)=0.86x1w2T+b2=1.0P(y=0x1)=0.07x1w3T+b3=1.0P(y=2x1)=0.07\begin{align*} \mathrm{x}_1 \cdot \mathrm{w}_1^T + b_1 &=& 1.5  &\Rightarrow & P(y=1|\mathrm{x}_1) &=& 0.86\\ \mathrm{x}_1 \cdot \mathrm{w}_2^T + b_2 &=& -1.0 &\Rightarrow & P(y=0|\mathrm{x}_1) &=& 0.07\\ \mathrm{x}_1 \cdot \mathrm{w}_3^T + b_3 &=& -1.0 &\Rightarrow & P(y=2|\mathrm{x}_1) &=& 0.07 \end{align*}

Since the probabiilty of y=1y=1 is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For x3\mathrm{x}_3, the following probabilities can be estimated:

x3w1T+b1=0 P(y=1x3)=0.14x3w2T+b2=0P(y=0x3)=0.14x3w3T+b3=1.5P(y=2x3)=0.69\begin{align*} \mathrm{x}_3 \cdot \mathrm{w}_1^T + b_1 &=& 0  &\Rightarrow & P(y=1|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_2^T + b_2 &=& 0 &\Rightarrow & P(y=0|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_3^T + b_3 &=& 1.5 &\Rightarrow & P(y=2|\mathrm{x}_3) &=& 0.69 \end{align*}

Since the probabiilty of y=2y=2 is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.

Softmax regression always predicts mm values so that it is represented by an output vector yR1×m\mathrm{y} \in \mathbb{R}^{1 \times m}, wherein the ii'th value in y\mathrm{y} contains the probability of the input belonging to the ii'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix WRm×n\mathrm{W} \in \mathbb{R}^{m \times n}, where the ii'th row represents the weight vector for the ii'th label.

With this new formulation, softmax regression can be defined as y=softmax(xWT)\mathrm{y} = \mathrm{softmax}(\mathrm{x} \cdot \mathrm{W}^T), and the optimal prediction can be achieved as argmax(y)\mathrm{argmax}(\mathrm{y}), which returns a set of labels with the highest probabilities.

circle-info

What are the limitations of the softmax regression model?

Multilayer Perceptron

A multilayer perceptron (MLP) is a type of Feedforward Neural Networksarrow-up-right consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector xR1×n\mathrm{x} \in \mathbb{R}^{1 \times n} and an output vector yR1×m\mathrm{y} \in \mathbb{R}^{1 \times m}, the model allows zero to many hidden layers to generate intermediate representations of the input.

Let hR1×d\mathrm{h} \in \mathbb{R}^{1 \times d} be a hidden layer between x\mathrm{x} and y\mathrm{y}. To connect x\mathrm{x} and h\mathrm{h}, we need a weight matrix WxRn×d\mathrm{W}_x \in \mathbb{R}^{n \times d} such that h=activation(xWx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x), where activation()\mathrm{activation}() is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Activation functionsarrow-up-right determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.

h=activation(xWx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)

Similarly, to connect h\mathrm{h} and y\mathrm{y}, we need a weight matrix WhRm×d\mathrm{W}_h \in \mathbb{R}^{m \times d} such that y=softmax(hWhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T). Thus, a multilayer perceptron with one hidden layer can be represented as:

y=softmax[activation(xWx)WhT]=softmax(hWhT)\mathrm{y} = \mathrm{softmax}[\mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x) \cdot \mathrm{W}_h^T] = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)
Types of decision regions that can be formed by different layers of MLP [1]/
circle-exclamation

Consider a corpus comprising the following five sentences the corresponding labels (\Rightarrow):

D1: I love this movie \Rightarrow postive

D2: I hate this movie \Rightarrow negative

D3: I watched this movie \Rightarrow neutral

D4: I truly love this movie \Rightarrow very positive

D5: I truly hate this movie \Rightarrow very negative

The input vectors x1..5\mathrm{x}_{1..5} can be created using the bag-of-words model:

circle-exclamation

The first weight matrix WxR7×5\mathrm{W}_x \in \mathbb{R}^{7 \times 5} can be trained by an MLP as follows:

Given the values in Wx\mathrm{W}_x, we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.

Each of x1..5\mathrm{x}_{1..5} is multiplied by Wx\mathrm{W}_x to achieve the hiddner layer h1..5\mathrm{h}_{1..5}, respectively, where the activation function is designed as follow:

activation(x)={ xifx>0.5 0otherwise\mathrm{activation}(x) = \left\{ \begin{array}{ll}   x & \text{if}\: x > 0.5\\   0 & \text{otherwise} \end{array} \right.

The second weight matrix WhR5×5\mathrm{W}_h \in \mathbb{R}^{5 \times 5} can also be trained by an MLP as follows:

By applying the softmax function to each hi WhT\mathrm{h}_i \cdot \mathrm{W}^T_h, we achieve the corresponding output vector yi\mathrm{y}_i:

The prediction can be made by taking the argmax of each yi\mathrm{y}_i.

circle-exclamation

References

Last updated

Was this helpful?