Softmax Cross-Entropy and Logits

Originally Written on:- November,2018.



Cross entropy


Let us start with cross entropy and try to understand  what it is and how it works?


 Cross-entropy is commonly used to quantify the difference between two probability distributions.
Now, when we develop a model for probabilistic classification, we aim to map the model's inputs to probabilistic predictions, and we often train our model by incrementally adjusting the model's parameters so that our predictions get closer and closer to ground-truth probabilities.
For example, if we're interested in determining whether an image is best described as a landscape or as a house or as something else, then our model might accept an image as input and produce three numbers as output, each representing the probability of a single class.

During training, we might put in an image of a landscape, and we hope that our model produces predictions that are close
to the ground-truth class probabilities
 y=(1.0,0.0,0.0)T    

   If our model predicts a different distribution, 
say y^=(0.4,0.1,0.5), then 
we'd like to nudge the parameters so that y^ will get closer to y.


But what exactly do we mean by "gets closer to"?
 In particular, how should we measure the difference between y^ and y?

This is where cross entropy comes into question and describes why it's reasonable for the task of classification.


If we think of distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool y. 
This is optimal, in that we can't encode 
the symbols using fewer bits on average. 
The equation
is given by.





Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution y^ will always make us use more bits. The only exception is the trivial case where y and y^ are equal, and in this case entropy and cross entropy are equal.


The KL divergence from y^ to y is simply the difference between cross entropy and entropy:





It measures the number of extra bits we'll need on average if we encode symbols from y according to y^; It's never negative, and it's 0 only when y and y^ are the same.

And of course, repeat until convergence.(Update(Gradient descent))



Softmax as an activation function for neural networks



In fact, 
convolutional neural networks popularize softmax so much as an activation function.  Softmax is somewhat of a  traditional activation function. For instance, the other activation functions produce a single output for a single input. In contrast, softmax produces multiple 
outputs for an input array. For this reason, we can build neural networks models that can 
classify more than 
2 classes instead of binary class solution.

Applying softmax function normalizes outputs in scale of [0, 1]. Also, sum of outputs will always be equal to 1 when softmax is applied. After then, applying one hot encoding transforms outputs in binary form. That’s why, softmax and one hot encoding would be applied respectively to neural networks output layer. Finally, true labelled output would be predicted classification output. Therefore, cross entropy function correlate between probabilities and one hot encoded labels.(Fig and source : sefiks)





Softmax function

Softmax function takes an N-dimensional vector of real numbers and transforms it into a vector of real number in range (0,1) which add upto 1.

As the name suggests, softmax function is a “soft” version of max function. Instead of selecting one maximum value, it breaks the whole (1) with maximal element getting the largest portion of the distribution, but other smaller elements getting some of it as well.




First of all, softmax normalizes the input array in scale of [0, 1]. Also, sum of the softmax outputs is always equal to 1. So, neural networks model classifies the instance as a class that have an index of the maximum output.(Fig and source: sefiks)




Now, let us talk about how this would work. Now, as you know the softmax function will be as follows:

For example, the following results will be retrieved when softmax is applied for the inputs above.
1- σ(x1) = ex1 / (ex1+ex2+ex3 ) = e2 / (e2 + e1+ e0.1) = 0.7
2- σ(x2) = ex2 / (ex1+ex2+ex3 ) = e1 / (e2 + e1+ e0.1) = 0.2
3- σ(x3) = ex3 / (ex1+ex2+ex3 ) = e0.1 / (e2 + e1+ e0.1) = 0.1
With attention to, inputs normalized between [0, 1]. Also, sum of the results are equal to 0.7 + 0.2 + 0.1 = 1.(Image courtesy: Udacity.org)
In this image, you can see how softmax values are compared by using distribution
 to one hot label encoder.(0.7, 0.2, 0.1)





So, softmax can be called as a type of activation function for neural networks.
which allows us to interpret the outputs as probabilities, while cross-entropy loss is what we use to measure the error at a softmax layer, and is given by





Logits

Logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal  1, that the values are not probabilities (you might have an input of 5).

tf.nn.softmax produces just the result of applying the softmax  function to an input tensor. The softmax  "squishes" the inputs so that sum(input) = 1;  it's a way of normalizing.  The shape of output of a softmax is the same as the input - it just normalizes the values.  The outputs of softmax can be interpreted as probabilities.

a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508  0.205666    0.25120102  0.37474789]]

If you add the output values it would be equal to 1.
In contrast, tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function.

Logits are the pre-transform values in a layer, and are not compared directly to the labels when calculating the cost function.
So, to sum it up an input goes through a linear model -> a logit(score) is formed and from there a softmax probabilities are  computed -> and lastly the probabilities are then compared to 1 Hot encoding labels by using the cross entropy function. is done and this procedure is known as 
Multinomial logistic classification.(Image courtesy: Udacity)






More Learning material and some of my sources/citations:

https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
https://stackoverflow.com/questions/34240703/what-is-logits-softmax-and-softmax-cross-entropy-with-logits
https://datascience.stackexchange.com/questions/20087/should-softmax-cross-entropy-with-logits-always-be-zero-if-logits-and-labels-are
https://deepnotes.io/softmax-crossentropy
https://sefiks.com/2017/12/17/a-gentle-introduction-to-cross-entropy-loss-function/
https://www.pyimagesearch.com/2016/09/12/softmax-classifiers-explained/
http://neuralnetworksanddeeplearning.com/chap3.html

https://medium.com/aidevnepal/for-sigmoid-funcion-f7a5da78fec2

http://cs231n.github.io/linear-classify/

Comments