Softmax Cross-Entropy and Logits
Originally Written on:- November,2018.
Cross entropy
If our model predicts a different distribution,
Softmax as an activation function for
neural networks
In fact, convolutional neural networks popularize softmax so much as an activation function. Softmax is somewhat of a traditional activation function. For instance, the other activation functions produce a single output for a single input. In contrast, softmax produces multiple
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
http://cs231n.github.io/linear-classify/
Cross entropy
Let
us start with cross entropy and try to understand what it is and how it
works?
Cross-entropy is
commonly used to quantify the difference between two probability distributions.
Now,
when we develop a model for probabilistic classification, we aim to map the
model's inputs to probabilistic predictions,
and we often train our
model by incrementally adjusting the model's parameters so that our predictions
get closer and closer to ground-truth
probabilities.
For
example, if we're interested in determining whether an image is best described
as a landscape or as
a house or as
something else, then our model might accept an image as input and produce three
numbers as output, each representing the probability of a single class.
During
training, we might put in an image of a landscape, and we hope that our model
produces predictions that are close
to
the ground-truth class probabilities
y=(1.0,0.0,0.0)T
If our model predicts a different distribution,
say y^=(0.4,0.1,0.5),
then
we'd
like to nudge the parameters so that y^ will get closer to
y.
But
what exactly do we mean by "gets closer to"?
In particular, how
should we measure the difference between y^ and y?
This is where cross entropy comes
into question and describes why it's reasonable for the task of classification.
If we think
of distribution as the tool we use to encode symbols, then entropy measures the
number of bits we'll need if we use the correct tool y.
This
is optimal, in that we can't encode
the
symbols using fewer bits on average.
The
equation
is
given by.
Cross
entropy is always larger than entropy; encoding symbols according to the
wrong distribution y^ will always make us use more
bits. The only exception is the trivial case where y and y^ are equal, and
in this case entropy and cross entropy are equal.
It
measures the number of extra bits
we'll need on average if we encode symbols from y according
to y^; It's never negative, and it's 0 only when y and y^ are the same.
And of
course, repeat until convergence.(Update(Gradient descent))
Softmax as an activation function for
neural networks
In fact, convolutional neural networks popularize softmax so much as an activation function. Softmax is somewhat of a traditional activation function. For instance, the other activation functions produce a single output for a single input. In contrast, softmax produces multiple
outputs
for an input array. For this reason, we can build neural networks models that
can
classify more
than
2 classes
instead of binary class solution.
Applying softmax function normalizes
outputs in scale of [0, 1]. Also, sum of outputs will always be equal to 1 when
softmax is applied. After then, applying one hot encoding transforms
outputs in binary form. That’s why, softmax and one hot encoding would be
applied respectively to neural networks output layer. Finally, true labelled
output would be predicted classification output. Therefore, cross
entropy function correlate between probabilities and one hot
encoded labels.(Fig and source : sefiks)
Softmax function
Softmax
function takes an N-dimensional vector of real numbers and transforms it into a
vector of real number in range (0,1) which add upto 1.
As
the name suggests, softmax function is a “soft” version of max function.
Instead of selecting one maximum value, it breaks the whole (1) with maximal
element getting the largest portion of the distribution, but other smaller
elements getting some of it as well.
First of all, softmax
normalizes the input array in scale of [0, 1]. Also, sum of the softmax outputs
is always equal to 1. So, neural networks model classifies the instance as a
class that have an index of the maximum output.(Fig and source: sefiks)
Now,
let us talk about how this would work. Now, as you know the softmax function
will be as follows:
For example, the following
results will be retrieved when softmax is applied for the inputs above.
1- σ(x1)
= ex1 / (ex1+ex2+ex3 )
= e2 / (e2 + e1+ e0.1) =
0.7
2- σ(x2)
= ex2 / (ex1+ex2+ex3 )
= e1 / (e2 + e1+ e0.1) =
0.2
3- σ(x3)
= ex3 / (ex1+ex2+ex3 )
= e0.1 / (e2 + e1+ e0.1)
= 0.1
With attention to, inputs
normalized between [0, 1]. Also, sum of the results are equal to 0.7 + 0.2 +
0.1 = 1.(Image courtesy: Udacity.org)
In this image, you can see
how softmax values are compared by using distribution
to one hot label
encoder.(0.7, 0.2, 0.1)
So, softmax can be called
as a type of activation function for neural networks.
which allows us to
interpret the outputs as probabilities, while cross-entropy loss is what we use to measure the
error at a softmax layer, and is given by
Logits
Logits
simply means that the function operates on the unscaled output of earlier
layers and that the relative scale to understand the units is linear. It means,
in particular, the sum of the inputs may not equal 1, that the
values are not probabilities (you might have an
input of 5).
tf.nn.softmax produces just the result of applying the softmax function to
an input tensor. The softmax "squishes" the inputs so that
sum(input) = 1; it's a way of normalizing. The shape of
output of a softmax is the same as the input - it just normalizes the
values. The outputs of softmax can be
interpreted as probabilities.
a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508 0.205666 0.25120102 0.37474789]]
If you add the output
values it would be equal to 1.
In
contrast,
tf.nn.softmax_cross_entropy_with_logits
computes
the cross entropy of the result after applying the softmax function.
Logits
are the pre-transform values in a layer, and are not compared directly to the
labels when calculating the cost function.
So, to
sum it up an input goes through a linear model -> a logit(score) is formed
and from there a softmax probabilities are computed -> and lastly the
probabilities are then compared to 1 Hot encoding labels by using the cross
entropy function. is done and this procedure is known as
Multinomial
logistic classification.(Image courtesy: Udacity)
More
Learning material and some of my sources/citations:
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
https://stackoverflow.com/questions/34240703/what-is-logits-softmax-and-softmax-cross-entropy-with-logits
https://datascience.stackexchange.com/questions/20087/should-softmax-cross-entropy-with-logits-always-be-zero-if-logits-and-labels-are
https://deepnotes.io/softmax-crossentropy
https://sefiks.com/2017/12/17/a-gentle-introduction-to-cross-entropy-loss-function/
https://www.pyimagesearch.com/2016/09/12/softmax-classifiers-explained/
http://neuralnetworksanddeeplearning.com/chap3.html
https://medium.com/aidevnepal/for-sigmoid-funcion-f7a5da78fec2
http://cs231n.github.io/linear-classify/
Comments
Post a Comment