Table of Content
- Linear
- Sigmoid
- Hyperbolic Tangent
- Rectified Linear Unit (ReLU)
- Leaky ReLU
- Softmax
- Softplus
It squashes a vector in the range (0, 1). It is applied independently to each element of
Gradient
Usage
- Binary Classification
- Tanh is just a rescaled and shifted sigmoid
- Tanh often performs well for deep nets
A "softened" version of the arg max. A generalization of the sigmoid function. An exponential follow by normalization.
- Soft: continuous and differentiable
- Max: arg max (its result is represented as a one-hot vector, is not continous or differentiable)
Purpose: To represent a probability distribution over a discrete variable with n possible values (over n different classes)
Requirement:
- Each element of
$\hat{y}_i$ be between 0 and 1 - The entire vector sums to 1 (so that it represents a valid probability distribution)
Approach: (the same approach that worked for the Bernoulli distribution generalizes to the multinoulli distribution)
- A linear layer predicts unnormalized log probabilities: (to be well-behaved for gradient-based optimization)
$$
\vec{z} = W^T \vec{h} + \vec{b}
$$
where
$\vec{z}_i = \log \tilde{P}(y = i|\vec{x})$ - Exponentiate and normalize
$\vec{z}$ to obtain the desired$\hat{y}$ $$ \operatorname{softmax}(\vec{z})_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)} $$
Derivatives:
$$ \frac {\partial \operatorname{softmax}({\vec {z}}){i}}{\partial x{j}}=\operatorname{softmax}({\vec {z}})_{i}(\delta {ij}-\operatorname{softmax}({\vec {z}}){j}) $$
use Kronecker delta
$\delta_{ij}$
Usage
- Multi-class Classification
Softmax
Softplus
Deep Learning
- Ch 6.2.2.3 Softmax Units for Multinoulli Output Distributions