Skip to content

ArshadIram/Loss_functions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

LossFunctions

The aim of this repository is to have all the loss functions at one place for reference. The loss functions provides small explanation followed by mathematical formula.

Contents
1. Mean Square Error
2. Mean Square Lograthmic Error
3. Mean Absolute Error
4. Mean Absolute Percentage Error
5. Binary Cross Entropy Loss/Log Loss
6. KL Divergence
7. Cross-Entropy Loss/Logistic Loss/Multinomial Logistic Loss
8. Sparse multi-class loss function
9. Categorical Cross Entropy Loss
10. Huber loss
11. Hinge loss
12. Squared hinge loss
13. Pairwise Ranking Loss
14. Triplet loss
15. Ranking Losses with Different Names
16. Center loss
17. Exponential loss
17. Taylor Cross Entropy
18. Symmetric Cross Entropy
19. Bi-Tempered Logistic Loss
20. Bi-Tempered Loss
21. Class Balanced Loss
22. Focal Cosign Loss
23. Focal Loss
24. Label Smoothing Loss
25. Face loss function

Mean Square Error

  • The mean squared error (MSE) tells you how close a regression line is to a set of points.
  • It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them.
  • The squaring is necessary to remove any negative signs.
  • It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors.
  • The lower the MSE, the better the forecast.

Mean Square Logarithmic Error

  • MSLE will treat small differences between small true and predicted values approximately the same as big differences between large true and predicted values
  • Use MSLE when doing regression, believing that your target, conditioned on the input, is normally distributed
  • You don’t want large errors to be significantly more penalized than small ones, in those cases where the range of the target value is large.

Mean Absolute Error

  • Absolute Error is the amount of error in your measurements.
  • The Mean Absolute Error(MAE) is the average of all absolute errors.

Mean Percentage Absolute Error

  • The mean absolute percentage error (MAPE) is a measure of how accurate a forecast system is.
  • It measures this accuracy as a percentage,

M=mean absolute percentage error   n=number of times the summation iteration

At=actual value  Ft = forecast value

Binary Cross-Entropy Loss

We will have 2 classes, positive and negative. In order to predict how good our predicted probabilities, we should return high values for bad predicitions and low values for good predictions. The formula for Binary Cross Entropy/Log loss can be given as below.

Negative log helps in penalizing the postive loss if it is too low.

KL Divergence

The Kullback-Leibler Divergence,or “KL Divergence” for short, is a measure of dissimilarity between two distributions.

This means that, the closer p(y) gets to q(y), the lower the divergence. So, we need to find a good p(y) to use. It looks for the best possible p(y), which is the one that minimizes the cross-entropy.

Cross-Entropy Loss

The cross entropy loss can be given as below.

Where ti and si are the groundtruth and the CNN score for each class i in C. As usually an activation function (Sigmoid / Softmax) is applied to the scores before the CE Loss computation, we write f(si) to refer to the activations.

Sparse multi-class loss function

The only difference between sparse categorical cross entropy and categorical cross entropy is the format of true labels. When we have a single-label, multi-class classification problem, the labels are mutually exclusive for each data, meaning each data entry can only belong to one class. Then we can represent y_true using one-hot embeddings.

For example, y_true with 3 samples, each belongs to class 2, 0, and 2.

Will become:

Categorical Cross-Entropy loss

Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss. If we use this loss, we will train a CNN to output a probability over the C classes for each image. It is used for multi-class classification.

image

In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class Cp keeps its term in the loss. There is only one element of the Target vector t which is not zero ti=tp. So discarding the elements of the summation which are zero due to target labels, we can write:

Huber Loss

Huber loss is a superb combination of linear as well as quadratic scoring methods. It has an additional hyperparameter delta. Loss is linear for values above delta and quadratic below delta. This parameter is tunable according to your data, which makes the Huber loss special.

The following figure shows the change in Huber loss for different values of the delta against error.

image

Hinge Loss

The x-axis represents the distance from the boundary of any single instance, and the y-axis represents the loss size, or penalty, that the function will incur depending on its distance.

image

Hinge loss is actually quite simple to compute. The formula for hinge loss is given by the following:

With l referring to the loss of any given instance, y[i] and x[i] referring to the ith instance in the training set and b referring to the bias term

Squared Hinge Loss

Suppose that you need to draw a very fine decision boundary. In that case, you wish to punish larger errors more significantly than smaller errors. Squared hinge loss may then be what you are looking for, especially when you already considered the hinge loss function for your machine learning problem.

image

Pairwise Ranking loss

In this setup positive and negative pairs of training data points are used. Positive pairs are composed by an anchor sample xa and a positive sample xp, which is similar to xa in the metric we aim to learn, and negative pairs composed by an anchor sample xa and a negative sample xn, which is dissimilar to xa in that metric.

Pairwise Ranking Loss forces representations to have 0 distance for positive pairs, and a distance greater than a margin for negative pairs. Being ra,rp and rn the samples representations and d a distance function, we can write:

For negative pairs, the loss will be 0 when the distance between the representations of the two pair elements is greater than the margin m. But when that distance is not bigger than m, the loss will be positive, and net parameters will be updated to produce more distant representation for those two elements. The loss value will be at most m, when the distance between ra and rn is 0. The function of the margin is that, when the representations produced for a negative pair are distant enough, no efforts are wasted on enlarging that distance, so further training can focus on more difficult pairs.

If r0 and r1 are the pair elements representations, y is a binary flag equal to 0 for a negative pair and to 1 for a positive pair and the distance d is the euclidian distance, we can equivalently write:

Triplet loss

This setup outperforms the former by using triplets of training data samples, instead of pairs. The triplets are formed by an anchor sample xa, a positive sample xp and a negative sample xn. The objective is that the distance between the anchor sample and the negative sample representations d(ra,rn) is greater (and bigger than a margin m) than the distance between the anchor and positive representations d(ra,rp).

The 3 situation to analyze

  • Easy Triplets: d(ra,rn)>d(ra,rp)+m. The negative sample is already sufficiently distant to the anchor sample respect to the positive sample in the embedding space. The loss is 0 and the net parameters are not updated.

  • Hard Triplets: d(ra,rn)<d(ra,rp). The negative sample is closer to the anchor than the positive. The loss is positive (and greater than m).

  • Semi-Hard Triplets: d(ra,rp)<d(ra,rn)<d(ra,rp)+m. The negative sample is more distant to the anchor than the positive, but the distance is not greater than the margin, so the loss is still positive (and smaller than m).

Ranking Losses with Different Names

  • Ranking loss: This name comes from the information retrieval field, where we want to train models to rank items in an specific order.

  • Margin Loss: This name comes from the fact that these losses use a margin to compare samples representations distances.

  • Contrastive Loss: Contrastive refers to the fact that these losses are computed contrasting two or more data points representations. This name is often used for Pairwise Ranking Loss, but I’ve never seen using it in a setup with triplets.

  • Triplet Loss: Often used as loss name when triplet training pairs are employed.

  • Hinge loss: Also known as max-margin objective. It’s used for training SVMs for classification. It has a similar formulation in the sense that it optimizes until a margin. That’s why this name is sometimes used for Ranking Losses.

Center loss

Helps in discriminating between features. Minimizing the intra-class variations while keeping the features of different classes separable is the key. To this end, we propose the center loss function

For this loss, you define a per class center which serves as the centroid of embeddings corresponding to that class. The gradient update is done over the mini-batch and a hyperparameter alpha controls the learning rates of the centers. The update is given by

Another scalar lambda is used to balance the two loss functions. The total loss

Last Updated: Sep-12

About

The purpose of this repository is to quickly access all the loss functions in one place.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published