The aim of this repository is to have all the loss functions at one place for reference. The loss functions provides small explanation followed by mathematical formula.
- The mean squared error (MSE) tells you how close a regression line is to a set of points.
- It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them.
- The squaring is necessary to remove any negative signs.
- It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors.
- The lower the MSE, the better the forecast.
- MSLE will treat small differences between small true and predicted values approximately the same as big differences between large true and predicted values
- Use MSLE when doing regression, believing that your target, conditioned on the input, is normally distributed
- You don’t want large errors to be significantly more penalized than small ones, in those cases where the range of the target value is large.
- Absolute Error is the amount of error in your measurements.
- The Mean Absolute Error(MAE) is the average of all absolute errors.
- The mean absolute percentage error (MAPE) is a measure of how accurate a forecast system is.
- It measures this accuracy as a percentage,
M=mean absolute percentage error n=number of times the summation iteration
At=actual value Ft = forecast value
We will have 2 classes, positive and negative. In order to predict how good our predicted probabilities, we should return high values for bad predicitions and low values for good predictions. The formula for Binary Cross Entropy/Log loss can be given as below.
Negative log helps in penalizing the postive loss if it is too low.
The Kullback-Leibler Divergence,or “KL Divergence” for short, is a measure of dissimilarity between two distributions.
This means that, the closer p(y) gets to q(y), the lower the divergence. So, we need to find a good p(y) to use. It looks for the best possible p(y), which is the one that minimizes the cross-entropy.
The cross entropy loss can be given as below.
Where ti and si are the groundtruth and the CNN score for each class i in C. As usually an activation function (Sigmoid / Softmax) is applied to the scores before the CE Loss computation, we write f(si) to refer to the activations.
The only difference between sparse categorical cross entropy and categorical cross entropy is the format of true labels. When we have a single-label, multi-class classification problem, the labels are mutually exclusive for each data, meaning each data entry can only belong to one class. Then we can represent y_true using one-hot embeddings.
For example, y_true with 3 samples, each belongs to class 2, 0, and 2.
Will become:
Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss. If we use this loss, we will train a CNN to output a probability over the C classes for each image. It is used for multi-class classification.
In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class Cp keeps its term in the loss. There is only one element of the Target vector t which is not zero ti=tp. So discarding the elements of the summation which are zero due to target labels, we can write:
Huber loss is a superb combination of linear as well as quadratic scoring methods. It has an additional hyperparameter delta. Loss is linear for values above delta and quadratic below delta. This parameter is tunable according to your data, which makes the Huber loss special.
The following figure shows the change in Huber loss for different values of the delta against error.
The x-axis represents the distance from the boundary of any single instance, and the y-axis represents the loss size, or penalty, that the function will incur depending on its distance.
Hinge loss is actually quite simple to compute. The formula for hinge loss is given by the following:
With l referring to the loss of any given instance, y[i] and x[i] referring to the ith instance in the training set and b referring to the bias term
Suppose that you need to draw a very fine decision boundary. In that case, you wish to punish larger errors more significantly than smaller errors. Squared hinge loss may then be what you are looking for, especially when you already considered the hinge loss function for your machine learning problem.
In this setup positive and negative pairs of training data points are used. Positive pairs are composed by an anchor sample xa and a positive sample xp, which is similar to xa in the metric we aim to learn, and negative pairs composed by an anchor sample xa and a negative sample xn, which is dissimilar to xa in that metric.
Pairwise Ranking Loss forces representations to have 0 distance for positive pairs, and a distance greater than a margin for negative pairs. Being ra,rp and rn the samples representations and d a distance function, we can write:
For negative pairs, the loss will be 0 when the distance between the representations of the two pair elements is greater than the margin m. But when that distance is not bigger than m, the loss will be positive, and net parameters will be updated to produce more distant representation for those two elements. The loss value will be at most m, when the distance between ra and rn is 0. The function of the margin is that, when the representations produced for a negative pair are distant enough, no efforts are wasted on enlarging that distance, so further training can focus on more difficult pairs.
If r0 and r1 are the pair elements representations, y is a binary flag equal to 0 for a negative pair and to 1 for a positive pair and the distance d is the euclidian distance, we can equivalently write:
This setup outperforms the former by using triplets of training data samples, instead of pairs. The triplets are formed by an anchor sample xa, a positive sample xp and a negative sample xn. The objective is that the distance between the anchor sample and the negative sample representations d(ra,rn) is greater (and bigger than a margin m) than the distance between the anchor and positive representations d(ra,rp).
The 3 situation to analyze
-
Easy Triplets:
d(ra,rn)>d(ra,rp)+m. The negative sample is already sufficiently distant to the anchor sample respect to the positive sample in the embedding space. The loss is0and the net parameters are not updated. -
Hard Triplets:
d(ra,rn)<d(ra,rp). The negative sample is closer to the anchor than the positive. The loss is positive (and greater thanm). -
Semi-Hard Triplets:
d(ra,rp)<d(ra,rn)<d(ra,rp)+m. The negative sample is more distant to the anchor than the positive, but the distance is not greater than the margin, so the loss is still positive (and smaller thanm).
-
Ranking loss: This name comes from the information retrieval field, where we want to train models to rank items in an specific order.
-
Margin Loss: This name comes from the fact that these losses use a margin to compare samples representations distances.
-
Contrastive Loss: Contrastive refers to the fact that these losses are computed contrasting two or more data points representations. This name is often used for Pairwise Ranking Loss, but I’ve never seen using it in a setup with triplets.
-
Triplet Loss: Often used as loss name when triplet training pairs are employed.
-
Hinge loss: Also known as max-margin objective. It’s used for training SVMs for classification. It has a similar formulation in the sense that it optimizes until a margin. That’s why this name is sometimes used for Ranking Losses.
Helps in discriminating between features. Minimizing the intra-class variations while keeping the features of different classes separable is the key. To this end, we propose the center loss function
For this loss, you define a per class center which serves as the centroid of embeddings corresponding to that class. The gradient update is done over the mini-batch and a hyperparameter alpha controls the learning rates of the centers. The update is given by
Another scalar lambda is used to balance the two loss functions. The total loss
Last Updated: Sep-12



