Neural Networks
Logistic Regression
Let be a vector representing an input instance, where denotes the 'th feature of the input and be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that belongs to :
The weight vector assigns weights to each dimension of the input vector for the label such that a higher magnitude of weight indicates greater importance of the feature . Finally, represents the bias of the label within the training distribution.
What role does the sigmoid function play in the logistic regression model?
Consider a corpus consisting of two sentences:
D1: I love this movie
D2: I hate this movie
The input vectors and can be created for these two sentences using the bag-of-words model:
Let and be the output labels of and , representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector can be trained using logistic regression:
Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights , , and are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight for "love" () contributes positively to the label , the weight for "hate" () has a negative impact on the label . Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias is also set to 0.
Given the weight vector and the bias, we have and , resulting the following probabilities:
As the probability of being exceeds (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being is below 50%.
Under what circumstances would the bias be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?
Softmax Regression
Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector and its output lable , the model uses the softmax function to estimates the probability that belongs to each class separately:
The weight vector assigns weights to for the label , while represents the bias associated with the label .
What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?
Consider a corpus consisting of three sentences:
D1: I love this movie
D2: I hate this movie
D3: I watched this movie
Then, the input vectors , , and for the sentences can be created using the bag-of-words model:
Let , , and be the output labels of , , and , representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors , , and can be trained using softmax regression as follows:
Unlike the case of logistic regression where all weights are oriented to (both and giving positive and negative weights to respectively, but not ), the values in each weigh vector are oriented to each corresponding label.
Given the weight vectors and the biases, we can estimate the following probabilities for :
Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For , the following probabilities can be estimated:
Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.
Softmax regression always predicts values so that it is represented by an output vector , wherein the 'th value in contains the probability of the input belonging to the 'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix , where the 'th row represents the weight vector for the 'th label.
With this new formulation, softmax regression can be defined as , and the optimal prediction can be achieved as , which returns a set of labels with the highest probabilities.
What are the limitations of the softmax regression model?
Multilayer Perceptron
A multilayer perceptron (MLP) is a type of Feedforward Neural Networks consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector and an output vector , the model allows zero to many hidden layers to generate intermediate representations of the input.
Let be a hidden layer between and . To connect and , we need a weight matrix such that , where is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Activation functions determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.
Similarly, to connect and , we need a weight matrix such that . Thus, a multilayer perceptron with one hidden layer can be represented as:
Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?
Consider a corpus comprising the following five sentences the corresponding labels ():
D1: I love this movie postive
D2: I hate this movie negative
D3: I watched this movie neutral
D4: I truly love this movie very positive
D5: I truly hate this movie very negative
The input vectors can be created using the bag-of-words model:
What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?
The first weight matrix can be trained by an MLP as follows:
Given the values in , we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.
Each of is multiplied by to achieve the hiddner layer , respectively, where the activation function is designed as follow:
The second weight matrix can also be trained by an MLP as follows:
By applying the softmax function to each , we achieve the corresponding output vector :
The prediction can be made by taking the argmax of each .
What are the limitations of a multilayer perceptron?
References
Neural Network Methodologies and their Potential Application to Cloud Pattern Recognition, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.
Last updated