arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Neural Networks

hashtag
Logistic Regression

Let x=[x1,…,xn]\mathrm{x} = [x_1, \ldots, x_n]x=[x1​,…,xn​] be a vector representing an input instance, where xix_ixi​ denotes the iii'th feature of the input and y∈{0,1}y \in \{0, 1\}y∈{0,1} be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that x\mathrm{x}x belongs to yyy:

P(y=1∣x)=11+e−(x⋅wT+b)P(y=0∣x)=1−P(y=1∣x)\begin{align*} P(y=1|\mathrm{x}) &= \frac{1}{1 + e^{-(\mathrm{x}\cdot\mathrm{w}^T+b)}}\\ P(y=0|\mathrm{x}) &= 1 - P(y=1|\mathrm{x}) \end{align*}P(y=1∣x)P(y=0∣x)​=1+e−(x⋅wT+b)1​=1−P(y=1∣x)​

The weight vector w=[w1,…,wn]\mathrm{w}=[w_1, \ldots, w_n]w=[w1​,…,wn​] assigns weights to each dimension of the input vector x\mathrm{x}x for the label y=1y=1y=1 such that a higher magnitude of weight wiw_iwi​ indicates greater importance of the feature . Finally, represents the bias of the label within the training distribution.

circle-exclamation

Q7: What role does the sigmoid function play in the logistic regression model?

Consider a corpus consisting of two sentences:

D1: I love this movie

D2: I hate this movie

The input vectors and can be created for these two sentences using the :

Let and be the output labels of and , representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector can be trained using logistic regression:

Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights , , and are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight for "love" () contributes positively to the label , the weight for "hate" () has a negative impact on the label . Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias is also set to 0.

Given the weight vector and the bias, we have and , resulting the following probabilities:

As the probability of being exceeds (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being is below 50%.

circle-exclamation

Q8: Under what circumstances would the bias be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?

hashtag
Softmax Regression

Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector and its output lable , the model uses the softmax function to estimates the probability that belongs to each class separately:

The weight vector assigns weights to for the label , while represents the bias associated with the label .

circle-exclamation

Q9: What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?

Consider a corpus consisting of three sentences:

D1: I love this movie

D2: I hate this movie

D3: I watched this movie

Then, the input vectors , , and for the sentences can be created using the :

Let , , and be the output labels of , , and , representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors , , and can be trained using softmax regression as follows:

Unlike the case of logistic regression where all weights are oriented to (both and giving positive and negative weights to respectively, but not ), the values in each weigh vector are oriented to each corresponding label.

Given the weight vectors and the biases, we can estimate the following probabilities for :

Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For , the following probabilities can be estimated:

Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.

Softmax regression always predicts values so that it is represented by an output vector , wherein the 'th value in contains the probability of the input belonging to the 'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix , where the 'th row represents the weight vector for the 'th label.

With this new formulation, softmax regression can be defined as , and the optimal prediction can be achieved as , which returns a set of labels with the highest probabilities.

circle-info

What are the limitations of the softmax regression model?

hashtag
Multilayer Perceptron

A multilayer perceptron (MLP) is a type of consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector and an output vector , the model allows zero to many hidden layers to generate intermediate representations of the input.

Let be a hidden layer between and . To connect and , we need a weight matrix such that , where is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.

Similarly, to connect and , we need a weight matrix such that . Thus, a multilayer perceptron with one hidden layer can be represented as:

circle-exclamation

Q10: Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?

Consider a corpus comprising the following five sentences the corresponding labels ():

D1: I love this movie postive

D2: I hate this movie negative

D3: I watched this movie neutral

D4: I truly love this movie very positive

D5: I truly hate this movie very negative

The input vectors can be created using the :

circle-exclamation

Q11: What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?

The first weight matrix can be trained by an MLP as follows:

Given the values in , we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.

Each of is multiplied by to achieve the hiddner layer , respectively, where the activation function is designed as follow:

The second weight matrix can also be trained by an MLP as follows:

By applying the softmax function to each , we achieve the corresponding output vector :

The prediction can be made by taking the argmax of each .

circle-exclamation

Q12: What are the limitations of a multilayer perceptron?

hashtag
References

  1. , J. E. Peak, Defense Technical Information Center, ADA239214, 1991.

xix_ixi​
bbb
y=1y=1y=1
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
y1=1y_1 =1y1​=1
y2=0y_2 = 0y2​=0
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
w\mathrm{w}w
w1​w_1​w1​​
w4​w_4​w4​​
w5​w_5​w5​​
w1​w_1​w1​​
x1​x_1​x1​​
y=1y=1y=1
w2w_2w2​
x2​x_2​x2​​
y=1y=1y=1
bbb
x1⋅wT+b=1.5\mathrm{x}_1 \cdot \mathrm{w}^T + b = 1.5x1​⋅wT+b=1.5
x2⋅wT+b=−1.5\mathrm{x}_2 \cdot \mathrm{w}^T + b = -1.5x2​⋅wT+b=−1.5
P(y=1∣x1)≈0.82P(y=1∣x2)≈0.18\begin{align*} P(y=1|\mathrm{x}_1) \approx 0.82\\ P(y=1|\mathrm{x}_2) \approx 0.18 \end{align*}P(y=1∣x1​)≈0.82P(y=1∣x2​)≈0.18​
x1​x_1​x1​​
y=1y=1y=1
0.50.50.5
y=1y=1y=1
bbb
x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n
y∈{0,…,m−1}y \in \{0, \ldots, m-1\}y∈{0,…,m−1}
x\mathrm{x}x
P(y=k∣x)=ex⋅wkT+bk∑j=1mex⋅wjT+bjP(y=k|\mathrm{x}) = \frac{e^{\mathrm{x}\cdot\mathrm{w}_k^T+b_k}}{\sum_{j=1}^m e^{\mathrm{x}\cdot\mathrm{w}_j^T+b_j}}P(y=k∣x)=∑j=1m​ex⋅wjT​+bj​ex⋅wkT​+bk​​
wk​\mathrm{w}_k​wk​​
x\mathrm{x}x
y=ky=ky=k
bk​b_k​bk​​
y=ky=ky=k
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
x3\mathrm{x}_3x3​
y1=1y_1 =1y1​=1
y2=0y_2 = 0y2​=0
y=2y=2y=2
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
x3\mathrm{x}_3x3​
w1\mathrm{w}_1w1​
w2\mathrm{w}_2w2​
w3\mathrm{w}_3w3​
y=1y = 1y=1
w1w_1w1​
w2w_2w2​
y=1y = 1y=1
y=0y=0y=0
x1\mathrm{x}_1x1​
x1⋅w1T+b1=1.5 ⇒P(y=1∣x1)=0.86x1⋅w2T+b2=−1.0⇒P(y=0∣x1)=0.07x1⋅w3T+b3=−1.0⇒P(y=2∣x1)=0.07\begin{align*} \mathrm{x}_1 \cdot \mathrm{w}_1^T + b_1 &=& 1.5  &\Rightarrow & P(y=1|\mathrm{x}_1) &=& 0.86\\ \mathrm{x}_1 \cdot \mathrm{w}_2^T + b_2 &=& -1.0 &\Rightarrow & P(y=0|\mathrm{x}_1) &=& 0.07\\ \mathrm{x}_1 \cdot \mathrm{w}_3^T + b_3 &=& -1.0 &\Rightarrow & P(y=2|\mathrm{x}_1) &=& 0.07 \end{align*}x1​⋅w1T​+b1​x1​⋅w2T​+b2​x1​⋅w3T​+b3​​===​1.5 −1.0−1.0​⇒⇒⇒​P(y=1∣x1​)P(y=0∣x1​)P(y=2∣x1​)​===​0.860.070.07​
y=1y=1y=1
x3\mathrm{x}_3x3​
x3⋅w1T+b1=0 ⇒P(y=1∣x3)=0.14x3⋅w2T+b2=0⇒P(y=0∣x3)=0.14x3⋅w3T+b3=1.5⇒P(y=2∣x3)=0.69\begin{align*} \mathrm{x}_3 \cdot \mathrm{w}_1^T + b_1 &=& 0  &\Rightarrow & P(y=1|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_2^T + b_2 &=& 0 &\Rightarrow & P(y=0|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_3^T + b_3 &=& 1.5 &\Rightarrow & P(y=2|\mathrm{x}_3) &=& 0.69 \end{align*}x3​⋅w1T​+b1​x3​⋅w2T​+b2​x3​⋅w3T​+b3​​===​0 01.5​⇒⇒⇒​P(y=1∣x3​)P(y=0∣x3​)P(y=2∣x3​)​===​0.140.140.69​
y=2y=2y=2
mmm
y∈R1×m\mathrm{y} \in \mathbb{R}^{1 \times m}y∈R1×m
iii
y\mathrm{y}y
iii
W∈Rm×n\mathrm{W} \in \mathbb{R}^{m \times n}W∈Rm×n
iii
iii
y=softmax(xâ‹…WT)\mathrm{y} = \mathrm{softmax}(\mathrm{x} \cdot \mathrm{W}^T)y=softmax(xâ‹…WT)
argmax(y)\mathrm{argmax}(\mathrm{y})argmax(y)
x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n
y∈R1×m\mathrm{y} \in \mathbb{R}^{1 \times m}y∈R1×m
h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d
x\mathrm{x}x
y\mathrm{y}y
x\mathrm{x}x
h\mathrm{h}h
Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d
h=activation(x⋅Wx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)h=activation(x⋅Wx​)
activation()\mathrm{activation}()activation()
h=activation(x⋅Wx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)h=activation(x⋅Wx​)
h\mathrm{h}h
y\mathrm{y}y
Wh∈Rm×d\mathrm{W}_h \in \mathbb{R}^{m \times d}Wh​∈Rm×d
y=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax(h⋅WhT​)
y=softmax[activation(x⋅Wx)⋅WhT]=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}[\mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x) \cdot \mathrm{W}_h^T] = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax[activation(x⋅Wx​)⋅WhT​]=softmax(h⋅WhT​)
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
x1..5\mathrm{x}_{1..5}x1..5​
Wx∈R7×5\mathrm{W}_x \in \mathbb{R}^{7 \times 5}Wx​∈R7×5
Wx\mathrm{W}_xWx​
x1..5\mathrm{x}_{1..5}x1..5​
Wx\mathrm{W}_xWx​
h1..5\mathrm{h}_{1..5}h1..5​
activation(x)={ xif x>0.5 0otherwise\mathrm{activation}(x) = \left\{ \begin{array}{ll}   x & \text{if}\: x > 0.5\\   0 & \text{otherwise} \end{array} \right.activation(x)={ x 0​ifx>0.5otherwise​
Wh∈R5×5\mathrm{W}_h \in \mathbb{R}^{5 \times 5}Wh​∈R5×5
hi⋅ WhT\mathrm{h}_i \cdot \mathrm{W}^T_hhi​⋅ WhT​
yi\mathrm{y}_iyi​
yi\mathrm{y}_iyi​
bag-of-words model
bag-of-words model
Feedforward Neural Networksarrow-up-right
Activation functionsarrow-up-right
bag-of-words model
Neural Network Methodologies and their Potential Application to Cloud Pattern Recognitionarrow-up-right
Types of decision regions that can be formed by different layers of MLP [1]/
V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie"}
x1 = [1, 1, 0, 1, 1]
x2 = [1, 0, 1, 1, 1]
w = [0.0, 1.5, -1.5, 0.0, 0.0]
b = 0
V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched"}
x1 = [1, 1, 0, 1, 1, 0]
x2 = [1, 0, 1, 1, 1, 0]
x3 = [1, 0, 0, 1, 1, 1]
w1 = [0.0,  1.5, -1.0, 0.0, 0.0, 0.0]
w2 = [0.0, -1.0,  1.5, 0.0, 0.0, 0.0]
w3 = [0.0, -1.0, -1.0, 0.0, 0.0, 1.5]
b1 = b2 = b3 = 0
X = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched", 6: "truly"}
Y = {0: "positive", 1: "negative", 2: "neutral", 3: "very positive", 4: "very negative"}

x1 = [1, 1, 0, 1, 1, 0, 0]
x2 = [1, 0, 1, 1, 1, 0, 0]
x3 = [1, 0, 0, 1, 1, 1, 0]
x4 = [1, 1, 0, 1, 1, 0, 1]
x5 = [1, 0, 1, 1, 1, 0, 1]

y1, y2, y3, y4, y5 = 0, 1, 2, 3, 4
Wx = [
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [1.0, 0.0, 0.0, 0.5, 0.0],
  [0.0, 1.0, 0.0, 0.0, 0.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 1.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.5, 0.5]
]
g1 = [1.0, 0.0, 0.0, 0.5, 0.0]
g2 = [0.0, 1.0, 0.0, 0.0, 0.5]
g3 = [0.0, 0.0, 1.0, 0.0, 0.0]
g4 = [1.0, 0.0, 0.0, 1.0, 0.5]
g5 = [0.0, 1.0, 0.0, 0.5, 1.0]
h1 = activation(g1) = [1.0, 0.0, 0.0, 0.0, 0.0]
h2 = activation(g2) = [0.0, 1.0, 0.0, 0.0, 0.0]
h3 = activation(g3) = [0.0, 0.0, 1.0, 0.0, 0.0]
h4 = activation(g4) = [1.0, 0.0, 0.0, 1.0, 0.0]
h5 = activation(g5) = [0.0, 1.0, 0.0, 0.0, 1.0]
Wh = [
  [ 1.0, -1.0, 0.0, -0.5, -1.0],
  [-1.0,  1.0, 0.0, -1.0, -0.5],
  [-1.0, -1.0, 1.0, -1.0, -1.0],
  [ 0.0, -1.0, 0.0,  1.0, -1.0],
  [-1.0,  0.0, 0.0, -1.0,  1.0]
]
o1 = [ 1.0, -1.0, -1.0,  0.0, -1.0]
o2 = [-1.0,  1.0, -1.0, -1.0,  0.0]
o3 = [ 0.0,  0.0,  1.0,  0.0,  0.0]
o4 = [ 0.5, -2.0, -2.0,  1.0, -2.0]
o5 = [-2.0,  0.5, -2.0, -2.0,  1.0]
y1 = softmax(o1) = [0.56, 0.08, 0.08, 0.21, 0.08]
y2 = softmax(o2) = [0.08, 0.56, 0.08, 0.08, 0.21]
y3 = softmax(o3) = [0.15, 0.15, 0.40, 0.15, 0.15]
y4 = softmax(o4) = [0.35, 0.03, 0.03, 0.57, 0.03]
y5 = softmax(o5) = [0.03, 0.35, 0.03, 0.03, 0.57]