Neural Networks
Logistic Regression
Let x=[x1,…,xn] be a vector representing an input instance, where xi denotes the i'th feature of the input and y∈{0,1} be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that x belongs to y:
The weight vector w=[w1,…,wn] assigns weights to each dimension of the input vector x for the label y=1 such that a higher magnitude of weight wi indicates greater importance of the feature xi. Finally, b represents the bias of the label y=1 within the training distribution.
Q7: What role does the sigmoid function play in the logistic regression model?
Consider a corpus consisting of two sentences:
D1: I love this movie
D2: I hate this movie
The input vectors x1 and x2 can be created for these two sentences using the bag-of-words model:
V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie"}
x1 = [1, 1, 0, 1, 1]
x2 = [1, 0, 1, 1, 1]Let y1=1 and y2=0 be the output labels of x1and x2, representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector w can be trained using logistic regression:
w = [0.0, 1.5, -1.5, 0.0, 0.0]
b = 0Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights w1, w4, and w5 are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight w1 for "love" (x1) contributes positively to the label y=1, the weight w2 for "hate" (x2) has a negative impact on the label y=1. Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias b is also set to 0.
Given the weight vector and the bias, we have x1⋅wT+b=1.5 and x2⋅wT+b=−1.5, resulting the following probabilities:
As the probability of x1 being y=1 exceeds 0.5 (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being y=1 is below 50%.
Q8: Under what circumstances would the bias b be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?
Softmax Regression
Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector x∈R1×n and its output lable y∈{0,…,m−1}, the model uses the softmax function to estimates the probability that x belongs to each class separately:
The weight vector wk assigns weights to x for the label y=k, while bk represents the bias associated with the label y=k.
Q9: What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?
Consider a corpus consisting of three sentences:
D1: I love this movie
D2: I hate this movie
D3: I watched this movie
Then, the input vectors x1, x2, and x3 for the sentences can be created using the bag-of-words model:
Let y1=1, y2=0, and y=2 be the output labels of x1, x2, and x3, representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors w1, w2, and w3 can be trained using softmax regression as follows:
Unlike the case of logistic regression where all weights are oriented to y=1 (both w1 and w2 giving positive and negative weights to y=1 respectively, but not y=0), the values in each weigh vector are oriented to each corresponding label.
Given the weight vectors and the biases, we can estimate the following probabilities for x1:
Since the probabiilty of y=1 is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For x3, the following probabilities can be estimated:
Since the probabiilty of y=2 is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.
Softmax regression always predicts m values so that it is represented by an output vector y∈R1×m, wherein the i'th value in y contains the probability of the input belonging to the i'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix W∈Rm×n, where the i'th row represents the weight vector for the i'th label.
With this new formulation, softmax regression can be defined as y=softmax(x⋅WT), and the optimal prediction can be achieved as argmax(y), which returns a set of labels with the highest probabilities.
What are the limitations of the softmax regression model?
Multilayer Perceptron
A multilayer perceptron (MLP) is a type of Feedforward Neural Networks consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector x∈R1×n and an output vector y∈R1×m, the model allows zero to many hidden layers to generate intermediate representations of the input.
Let h∈R1×d be a hidden layer between x and y. To connect x and h, we need a weight matrix Wx∈Rn×d such that h=activation(x⋅Wx), where activation() is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Activation functions determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.
Similarly, to connect h and y, we need a weight matrix Wh∈Rm×d such that y=softmax(h⋅WhT). Thus, a multilayer perceptron with one hidden layer can be represented as:

Q10: Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?
Consider a corpus comprising the following five sentences the corresponding labels (⇒):
D1: I love this movie ⇒ postive
D2: I hate this movie ⇒ negative
D3: I watched this movie ⇒ neutral
D4: I truly love this movie ⇒ very positive
D5: I truly hate this movie ⇒ very negative
The input vectors x1..5 can be created using the bag-of-words model:
Q11: What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?
The first weight matrix Wx∈R7×5 can be trained by an MLP as follows:
Given the values in Wx, we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.
Each of x1..5 is multiplied by Wx to achieve the hiddner layer h1..5, respectively, where the activation function is designed as follow:
The second weight matrix Wh∈R5×5 can also be trained by an MLP as follows:
By applying the softmax function to each hi⋅ WhT, we achieve the corresponding output vector yi:
The prediction can be made by taking the argmax of each yi.
Q12: What are the limitations of a multilayer perceptron?
References
Neural Network Methodologies and their Potential Application to Cloud Pattern Recognition, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.
Last updated
Was this helpful?