What is the softmax activation function?

The softmax activation function is used in the output layers of deep learning neural network models to classify the output of a neural network layer into multiple classes. This activation function is used in models for multiclass classification, such as determining if an image fed into the deep learning model is a cat or a dog.

Mathematical expression

Below we can see the mathematical expression for the softmax activation function.

To understand how the softmax function works, we will break down the equation into two parts:

The denominator
The numerator

The denominator

Here we will find the normalization factor by summing all the classes pre-defined in our classification model. The value of k in summation ( Σ ) will equal the total number of classes.

We take the exponent ( e ) of all the individual logits A logit is a raw prediction value of the neural network model before it is normalized into a probability with the help of a activation function. ( y _j ) and sum them together to get the normalization factor which we then use to find the probabilities.

The numerator

In the numerator, we take the individual logit ( y _i ), the probability of which we want to calculate. We raise it to the exponent's power and divide it by the normalization factor we calculated in the denominator to retrieve the probability for that specific logit.

To ensure that the probabilities we calculated represent a valid probability distribution over all the pre-defined classes in our model, we can sum them all together, and they should equal 1.

Softmax internal working

Now that we have looked at the mathematical equation for the softmax function, let's understand how it works in a neural network model with the help of a diagram.

In the diagram, we use an image classification deep learning model which classifies an image into three classes. We pass the input image to the model, which consists of numerous layers; the output layer of the model produces the logit values.

The logits are passed on to the softmax activation function, which maps each logit to the probability that the image belongs to a certain class. After that, the exponent value of the individual logits is divided by the sum of all the exponent values to find the probabilities.

The class with the highest probability would be the final prediction of the model. Hence, the image would belong to the second class with a probability score 0.66.

Note: The number of logits and probabilities would be equal to the total number of the different classes that we want to predict.

Code example

To understand the mathematical internal working for the softmax activation function, we can take a look at the C++ code below where the input vector represents the logits from the output layer that are to be passed to the softmax() function.

The softmax() function returns a vector that we store in the variable output that contains the probabilities calculated by the function against the logits.

#include <iostream>
#include <cmath>
#include <vector>
using namespace std;
void printVector(vector<double> x){
    for (int i = 0; i < x.size(); ++i) {
        cout << x[i] << " ";
    }
    cout << endl << endl;
}
vector <double> softmax(vector<double> &input) {
    vector<double> softmaxOutput;
    vector<double> exponents;
    double denominator = 0;
    for (int i = 0; i < input.size(); ++i) {
        exponents.push_back(exp(input[i]));
        denominator += exp(input[i]);
    }
    for (int i = 0; i < input.size(); ++i) {
        softmaxOutput.push_back(exponents[i] / denominator);
    }
    return softmaxOutput;
}
int main(){
    vector<double> input = {0.25, 1.23, -0.8};
    cout << "\nInput Logits: ";
    printVector(input);
    vector<double> output = softmax(input);
    
    cout << "\nSoftmax Output: ";
    printVector(output);
    double probabilitySum = 0;
    for (int i = 0; i < output.size(); ++i) {
        probabilitySum += output[i];
    }
    cout << "\nSum of the outputs: " << probabilitySum << endl << endl;
    return 0;
}

Code explanation

Lines 1–3: Import the required libraries, cmath will be used for the exponent operation.
Lines 6–11: We define a function printVector that takes in a vector of type double and prints its contents using a for loop.
Lines 14–29: Here, we define the softmax function that takes a vector of type double as an argument and returns another vector of type double that represents a vector of probabilities.
Lines 20–23: We now iterate over the logits contained in the input vector using a for loop. For each logit, we calculate its exponent using the exp() function, add it to the vector exponents using the pushback() function, and sum it with the variable named denominator that holds the sum of all exponents.
Lines 25–27: Finally, we loop over the exponents array and find the probability for each logit by dividing its exponent against the denominator variable.
Lines 43–46: Once we run the function and retrieve the softmax output, we then sum over all the probability values and store their sum in the variable probabilitySum , to ensure they are equal to 1.

Conclusion

The softmax activation function is used in multiclass classification problems to amplify the differences between the logit values produced by the output layer by converting them into probabilities and giving the logit with the greatest value the highest probability for better class prediction.