Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized dice loss for multi-class segmentation #9395

Closed
emoebel opened this issue Feb 15, 2018 · 52 comments
Closed

Generalized dice loss for multi-class segmentation #9395

emoebel opened this issue Feb 15, 2018 · 52 comments

Comments

@emoebel
Copy link

emoebel commented Feb 15, 2018

Hey guys, I just implemented the generalised dice loss (multi-class version of dice loss), as described in ref :
(my targets are defined as: (batch_size, image_dim1, image_dim2, image_dim3, nb_of_classes))

def generalized_dice_loss_w(y_true, y_pred): 
    # Compute weights: "the contribution of each label is corrected by the inverse of its volume"
    Ncl = y_pred.shape[-1]
    w = np.zeros((Ncl,))
    for l in range(0,Ncl): w[l] = np.sum( np.asarray(y_true[:,:,:,:,l]==1,np.int8) )
    w = 1/(w**2+0.00001)

    # Compute gen dice coef:
    numerator = y_true*y_pred
    numerator = w*K.sum(numerator,(0,1,2,3))
    numerator = K.sum(numerator)
    
    denominator = y_true+y_pred
    denominator = w*K.sum(denominator,(0,1,2,3))
    denominator = K.sum(denominator)
    
    gen_dice_coef = numerator/denominator
    
    return 1-2*gen_dice_coef

But something must be wrong. I'm working with 3D images that I have to segment for 4 classes (1 background class and 3 object classes, I have a imbalanced dataset). First odd thing: while my train loss and accuracy improve during training (and converge really fast), my validation loss/accuracy are constant trough epochs (see image). Second, when predicting on test data, only the background class is predicted: I get a constant volume.

I used the exact same data and script but with categorical cross-entropy loss and get plausible results (object classes are segmented). Which means something is wrong with my implementation. Any idea what it could be?

Plus I believe it would be usefull to the keras community to have a generalised dice loss implementation, as it seems to be used in most of recent semantic segmentation tasks (at least in the medical image community).

PS: it seems odd to me how the weights are defined; I get values around 10^-10. Anyone else has tried to implement this? I also tested my function without the weights but get same problems.

@xychenunc
Copy link

Hi, I came across a similar issue. The dataset is imbalanced, and the region which is small compared to the whole image cannot be well segmented. I think it has nothing to do with your loss function, maybe it is due to patch-based approach. How large you choose your patch size?

@xychenunc
Copy link

And I think the problem with your loss function is the weights are not normalized. I think a normalized weights should be what you want. And w = 1/(w**2+0.00001) maybe should be rewritten as something like w = w/(np.sum(w)+0.00001). Otherwise, the generalized loss is not 'balanced', the region which takes a larger portion of the image accounts for a relatively small part in the total loss.

@emoebel
Copy link
Author

emoebel commented Mar 6, 2018

Hey xynechunc, thanks for your answer! I tried normalizing the weights, but it didn't do any difference. While it is true that the weight values are better interpretable (instead of values around 10^-10 I have now values between 0 and 1), it seems that numerically it does not change the loss behaviour.

Why are you asking about the patch size? Aren't the weights supposed to cope with the unbalanced class problem? Anyway, my patch size is 56x56x56 voxel, and my objects have a diameter of 10 voxel. In my patches, I have in average 7% of voxels labeled as object, the rest is background.

@jpcenteno80
Copy link

I am trying something similar for a 2D semantic segmentation project with 10 categories (label 0 is background). Before trying dice, I was using sparse categorical crossentropy with very good results. However, because label 0 was being included in the loss calculation, both training and validation accuracy were artificially high (> 0.98). My implementation of dice is based on this: https://github.com/Lasagne/Recipes/issues/99.

y_true has shape (batch,m,n,1) and y_pred has shape (batch,m,n,10). Here is my version of dice:

def dice_coef_9cat(y_true, y_pred, smooth=1e-7):
    '''
    Dice coefficient for 10 categories. Ignores background pixel label 0
    Pass to model as metric during compile statement
    '''
    y_true_f = K.flatten(K.one_hot(K.cast(y_true, 'int32'), num_classes=10)[...,1:])
    y_pred_f = K.flatten(y_pred[...,1:])
    intersect = K.sum(y_true_f * y_pred_f, axis=-1)
    denom = K.sum(y_true_f + y_pred_f, axis=-1)
    return K.mean((2. * intersect / (denom + smooth)))

def dice_coef_9cat_loss(y_true, y_pred):
    '''
    Dice loss to minimize. Pass to model as loss during compile statement
    '''
    return 1 - dice_coef_9cat(y_true, y_pred)

A model trained with the above implementation of dice tends to predict 4 out of the 9 categories and the segmentation is less than ideal and much worse than I got with sparse categorical crossentropy.

However, when I convert the segmentation task into a binary decision (merge all categories into one), the segmentation is pretty good. Here is the loss function changed for a binary problem:

def dice_coef_binary(y_true, y_pred, smooth=1e-7):
    '''
    Dice coefficient for 2 categories. Ignores background pixel label 0
    Pass to model as metric during compile statement
    '''
    y_true_f = K.flatten(K.one_hot(K.cast(y_true, 'int32'), num_classes=2)[...,1:])
    y_pred_f = K.flatten(y_pred[...,1:])
    intersect = K.sum(y_true_f * y_pred_f, axis=-1)
    denom = K.sum(y_true_f + y_pred_f, axis=-1)
    return K.mean((2. * intersect / (denom + smooth)))


def dice_coef_binary_loss(y_true, y_pred):
    '''
    Dice loss to minimize. Pass to model as loss during compile statement
    '''
    return 1 - dice_coef_binary(y_true, y_pred)

Not sure if the binary results are better because it is an 'easier' task or because my dice loss function is wrong.

@xychenunc
Copy link

xychenunc commented Mar 20, 2018 via email

@emoebel
Copy link
Author

emoebel commented Mar 27, 2018

@xychenunc thanks for your answer, I realised also that the problem is class imbalance. I just dont understand why, because the weights in the loss function are supposed to cope for that.

@jpcenteno80 does sparse-categorical-crossentropy work better than normal categorical-crossentropy in the case of segmentation? I tried it out myself, but I'm getting an error concerning array shapes:
ValueError: Error when checking target: expected conv3d_15 to have shape (56, 56, 56, 1) but got array with shape (56, 56, 56, 4)

Concerning your loss function 'dice_coef_9cat_loss': I don't think it is a good idea to ignore background. Examples of "non-objects" are as important as examples of "objects", the problem is class imbalance. If you completely ignore the background, chances are that you'll get a lot of false positives.
Concerning your function 'dice_coef_binary_loss': maybe it works better because by merging all your object classes there is a better balance between "background" and "objects".

Check out the ref I cited in my original post, they describe how to implement dice loss for multiple imbalanced classes. Maybe your implementation will work, because mine doesn't and I don't understand why.

@jpcenteno80
Copy link

When using sparse categorical crossentropy, you don't need to one-hot encode your target. Here is a discussion comparing crossentropy vs sparse crossentropy in stackoverflow (google: TensorFlow: what's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?) <- Copying the link wasn't working for me.
So in my case (2D segmentation model) I feed in the raw mask as my target, which is an array 384x384 with pixels labeled from 0 to 9 (10 categories).

I ended up getting decent results with the 'dice_coef_9cat_loss' function after all (neglecting the background label). It just took longer training and starting with lower learning rates (Nadam, 1e-5 or 1e-4). I also tried cyclical learning rates, which I think helped: https://github.com/bckenstler/CLR. I decided to neglect background in the loss calculation because my class imbalance was pretty large and I could not figure out in Keras how to use 'sample_weight' in the fit method with 2D arrays.

I'm also transitioning into pytorch and I like that it seems more flexible in terms of setting up custom metrics or loss functions. I am using the tiramisu architecture for semantic segmentation which uses negative log likelihood as the loss (implementation here: https://github.com/bfortuner/pytorch_tiramisu). The results so far are great. I highly recommend using this architecture for semantic segmentation. Have not tried it in 3D though.

@zdryan
Copy link

zdryan commented Mar 28, 2018

I've taken the same approach as @jpcenteno80 as I am also unable to successfully implement the generalised dice loss. Would rather avoid using temporal sample weights.

@emoebel
Copy link
Author

emoebel commented Apr 6, 2018

Hey guys, I found a way to implement multi-class dice loss, I get satisfying segmentations now. I implemented the loss as explained in ref : this paper describes the Tversky loss, a generalised form of dice loss, which is identical to dice loss when alpha=beta=0.5

Here is my implementation, for 3D images:

# Ref: salehi17, "Twersky loss function for image segmentation using 3D FCDN"
# -> the score is computed for each class separately and then summed
# alpha=beta=0.5 : dice coefficient
# alpha=beta=1   : tanimoto coefficient (also known as jaccard)
# alpha+beta=1   : produces set of F*-scores
# implemented by E. Moebel, 06/04/18
def tversky_loss(y_true, y_pred):
    alpha = 0.5
    beta  = 0.5
    
    ones = K.ones(K.shape(y_true))
    p0 = y_pred      # proba that voxels are class i
    p1 = ones-y_pred # proba that voxels are not class i
    g0 = y_true
    g1 = ones-y_true
    
    num = K.sum(p0*g0, (0,1,2,3))
    den = num + alpha*K.sum(p0*g1,(0,1,2,3)) + beta*K.sum(p1*g0,(0,1,2,3))
    
    T = K.sum(num/den) # when summing over classes, T has dynamic range [0 Ncl]
    
    Ncl = K.cast(K.shape(y_true)[-1], 'float32')
    return Ncl-T

I would be curious to know if this works for your applications. To adapt from 3D images to 2D images, you should modify all "sum(...,(0,1,2,3))" to "sum(...,(0,1,2))".

@gattia
Copy link

gattia commented Apr 6, 2018

@lazyleaf, I just stumbled upon this. I am doing 3D segmentation on multiclass. I will definitely try out the proposed method and see how it works. However, I also have another solution that has worked for me in the past:

def dice_coef(y_true, y_pred):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_coef_multilabel(y_true, y_pred, numLabels=5):
    dice=0
    for index in range(numLabels):
        dice -= dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])
    return dice

This simply calculates the dice score for each individual label, and then sums them together, and includes the background. The best dice score you will ever get is equal to numLables*-1.0. When monitoring I always keep in mind that the dice for the background is almost always near 1.0.

@kroskal
Copy link

kroskal commented Apr 17, 2018

@lazyleaf, I was also struggling to implement this loss function. But with som inspiration from your code, here is my take on it (for 2D images).
Ps! I haven't tried to train a network with it yet.

def generalized_dice_coeff(y_true, y_pred):
    Ncl = y_pred.shape[-1]
    w = K.zeros(shape=(Ncl,))
    w = K.sum(y_true, axis=(0,1,2))
    w = 1/(w**2+0.000001)
    # Compute gen dice coef:
    numerator = y_true*y_pred
    numerator = w*K.sum(numerator,(0,1,2,3))
    numerator = K.sum(numerator)

    denominator = y_true+y_pred
    denominator = w*K.sum(denominator,(0,1,2,3))
    denominator = K.sum(denominator)

    gen_dice_coef = 2*numerator/denominator

    return gen_dice_coef

def generalized_dice_loss(y_true, y_pred):
    return 1 - generalized_dice_coeff(y_true, y_pred)

@lkshrsch
Copy link

lkshrsch commented May 7, 2018

@lazyleaf thank you for pointing to the tversky loss. I implemented your code (had to change K.shape --> K.int_shape ) but it still complaints that "TypeError: long() argument must be a string or a number, not 'NoneType'".

Do you know why this is happening, and do you see it in your own code?

@rakeshakkineni
Copy link

@lkshrsch To remove the compilation error i have replaced "ones = K.ones(K.shape(y_true))" with "ones = K.ones_like(y_true)".

@DaanKuppens
Copy link

@kroskal Thanks for the implementation. Did you try training a network yet with this loss?
When I use your generalized dice loss I somehow get loss values larger than 1 and dice coefficients smaller than 0. Do you know what may cause this?

@DaanKuppens
Copy link

I have the generalized_dice_coef and generalized_dice_loss now working between [0 1] for 2D images. I normalized the weights to the presence of the class in the entire dataset instead of just the batch, using the following code:

def generalized_dice_coeff(y_true, y_pred):
n_el = 1
for dim in y_train.shape: 
    n_el *= int(dim)
n_cl = y_train.shape[-1]
w = K.zeros(shape=(n_cl,))
w = (K.sum(y_true, axis=(0,1,2)))/(n_el)
w = 1/(w**2+0.000001)
numerator = y_true*y_pred
numerator = w*K.sum(numerator,(0,1,2))
numerator = K.sum(numerator)
denominator = y_true+y_pred
denominator = w*K.sum(denominator,(0,1,2))
denominator = K.sum(denominator)
return 2*numerator/denominator

@CanPhi
Copy link

CanPhi commented Jun 6, 2018

Has anyone successfully trained a network with @lazyleaf 's implementation of the tversky loss yet?

@darrell-rg
Copy link

darrell-rg commented Jun 19, 2018

I trained a RGB U-Net with 2 output classes using tversky @lazyleaf 's loss and it worked great. Got 96% accuracy in only 32 epochs. I am using Tensorflow-GPU for the backend, if you use Theano the tensors are in a different order and may work differently. It is critical that your inputs are arranged in the correct order, beware of reshapes that might not work how you expect.

@xavieryxie
Copy link

Has anyone successfully trained a network with generalized dice loss for 5-class segmentation ? It is strange that i just can segment out one of class in five classes.

@freekoy
Copy link

freekoy commented Sep 27, 2018

@lazyleaf
Can you tell me what is the activation function of the model you used for this loss function, softmax?

@freekoy
Copy link

freekoy commented Sep 27, 2018

@jpcenteno80 Can you tell me what is the activation function of the model you used for this dice_coef_binary_loss loss function, softmax?and model last layer output shape?

@bowfan94
Copy link

@lazyleaf Hi, thank you for your code. But when I 'm trying to implement this loss to my mulit-class segmentation, which has 3 classes, the outpu are just of two classes. Do you know what may cause the problem?

@zheng-xing
Copy link

@lazyleaf Hi, thank you for this excellent general code that can be dice_coef as well as IOU loss depend on the alpha and beta values you choose. Previously tried manually assigned higher weights for classes with less number of pixels. It does solve the problem that certain classes don't appear in the prediction. However the method will cause some false positives. After I tried this dice loss, it works great. All classes gets predicted and there is no false positives.

@wt-huang
Copy link

wt-huang commented Nov 7, 2018

Closing as this is resolved

@sara-eb
Copy link

sara-eb commented Feb 13, 2019

@lazyleaf is p0 soft output (values between [0,1]), probabilities obtained from softmax? Should I use argmax function to get the hard outputs first?
I am trying to implement generalized dice loss for caffe, but still I am struggling,
Thanks

@sara-eb
Copy link

sara-eb commented Feb 17, 2019

Hey guys, I found a way to implement multi-class dice loss, I get satisfying segmentations now. I implemented the loss as explained in ref : this paper describes the Tversky loss, a generalised form of dice loss, which is identical to dice loss when alpha=beta=0.5

Here is my implementation, for 3D images:

# Ref: salehi17, "Twersky loss function for image segmentation using 3D FCDN"
# -> the score is computed for each class separately and then summed
# alpha=beta=0.5 : dice coefficient
# alpha=beta=1   : tanimoto coefficient (also known as jaccard)
# alpha+beta=1   : produces set of F*-scores
# implemented by E. Moebel, 06/04/18
def tversky_loss(y_true, y_pred):
    alpha = 0.5
    beta  = 0.5
    
    ones = K.ones(K.shape(y_true))
    p0 = y_pred      # proba that voxels are class i
    p1 = ones-y_pred # proba that voxels are not class i
    g0 = y_true
    g1 = ones-y_true
    
    num = K.sum(p0*g0, (0,1,2,3))
    den = num + alpha*K.sum(p0*g1,(0,1,2,3)) + beta*K.sum(p1*g0,(0,1,2,3))
    
    T = K.sum(num/den) # when summing over classes, T has dynamic range [0 Ncl]
    
    Ncl = K.cast(K.shape(y_true)[-1], 'float32')
    return Ncl-T

I would be curious to know if this works for your applications. To adapt from 3D images to 2D images, you should modify all "sum(...,(0,1,2,3))" to "sum(...,(0,1,2))".

@lazyleaf, when I write Ncl-T the values are not between 0 and 1, when I remove this line the values are between 0 and 1 and sometimes the loss, becomes 1. Is that correct? or I am doing some mistakes somewhere?

Since I am implementing for Caffe, I have to write the gradients manually, Is that okay that I have only calculated dT/dp0? I just calculate the gradients according to equation 4 in original reference by salehi et al.
PS, according to the paper, I am sending the softmax output (probabilities between 0 and 1) to the loss layer without taking argmax of soft outputs. Is this correct?

@EmielBoss
Copy link

I am trying to perform semantic segmentation in TensorFlow 1.10 with eager execution with the generalized dice loss function:

def generalized_dice_loss(onehots_true, logits):
    onehots_true, logits = mask(onehots_true, logits) # Not all of my pixels contain ground truth, and I filter those pixels out, which results in shape [num_gt_pixels, num_classes]-shaped labels and logits.
    probabilities = tf.nn.softmax(logits)
    weights = 1.0 / (tf.reduce_sum(onehots_true, axis=0)**2)
    weights = tf.clip_by_value(weights, 1e-17, 1.0 - 1e-7) # Is this the correct way of dealing with inf values (the results of zero divisions)?
    numerator = tf.reduce_sum(onehots_true * probabilities, axis=0)
    numerator = tf.reduce_sum(weights * numerator)

    denominator = tf.reduce_sum(onehots_true + probabilities, axis=0)
    denominator = tf.reduce_sum(weights * denominator)

    loss = 1.0 - 2.0 * (numerator + 1e-17) / (denominator + 1e-17)
    return loss

However, I am struggling to get any meaningful loss which isn't always 1. What am I doing wrong here?

After the initial weights (one for each class) are calculated, they contain many inf's from zero divisions, as typically only a small subset of all classes is present in a sample image. Therefore, I clip the weights to the range [1e-17, 1-1e-17] (is this a good idea?), after which they look like this:

tf.Tensor(
[4.89021e-05 2.21410e-10 5.43187e-11 1.00000e+00 1.00000e+00 4.23855e-07
 5.87461e-09 3.13044e-09 2.95369e-07 1.00000e+00 1.00000e+00 2.22499e-05
 1.00000e+00 1.73611e-03 9.47212e-10 1.12608e-05 2.77563e-09 1.00926e-08
 7.74787e-10 1.00000e+00 1.34570e-07], shape=(21,), dtype=float32)

which seems fine to me, though they are pretty small. The numerators (tf.reduce_sum(onehots_true * probabilities, axis=0), prior to their weighting) look like this:

tf.Tensor(
[3.42069e+01 0.00000e+00 9.43506e+03 7.88478e+01 1.50554e-02 0.00000e+00
 1.22765e+01 4.36149e-01 1.75026e+02 0.00000e+00 2.33183e+02 1.81064e-01
 0.00000e+00 1.60128e+02 1.48867e+04 0.00000e+00 3.87697e+00 4.49753e+02
 5.87062e+01 0.00000e+00 0.00000e+00], shape=(21,), dtype=float32)
tf.Tensor(1.0, shape=(), dtype=float32)

which also looks reasonable, since they're basically the labels' respective sizes times the network's certainty about them (which is likely low in the beginning of training). The denominators (tf.reduce_sum(onehots_true + probabilities, axis=0), prior to weighting) also look fine:

tf.Tensor(
[ 14053.483   25004.557  250343.36    66548.234    6653.863    3470.502
   5318.3926 164206.19    19914.338    1951.0701   3559.3235   7248.4717
   5984.786    7902.9004 133984.66    41497.473   25010.273   22232.062
  26451.926   66250.39     6497.735 ], shape=(21,), dtype=float32)

These are large, but that is to be expected since the class probabilities of a pixel sum to 1, and therefore the sum of these denominators should more or less equal the amount of pixels with ground truth.

However, summing the numerators gives a very small sum (~0.001, though occasionally it's in a single digit range) while the denominator sums to very large values. This results in my final loss being exclusively 1, or something really close to that. Does anyone know how I can mitigate this effect and obtain stable gradients?

@samra-irshad
Copy link

@gattia your solution worked for me, thanks.

@gattia
Copy link

gattia commented Oct 23, 2019

@samra-irshad, glad that it worked out for you! It is a simple method of doing it, but makes sense and seems to work.

@ben-davidson-6
Copy link

Since this seems to be quite active despite being closed, and where I ended up from google I'll link my solution to the same problem on stack overflow

@mptorr
Copy link
Contributor

mptorr commented Feb 5, 2020

@gattia how would you modify your code for a 2D application? my images are 2D with channels last and have 4 classes. Thanks in advance for any insight you can provide.

For reference this your original 3D code:

def dice_coef(y_true, y_pred):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_coef_multilabel(y_true, y_pred, numLabels=5):
    dice=0
    for index in range(numLabels):
        dice -= dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])
    return dice

@gattia
Copy link

gattia commented Feb 10, 2020

@mptorr thanks for the interest.

in this line:
dice -= dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])

the indices for y_true and y_pred are:

[images, label, x, y, z]

images is the many/different images I am trying to segment and are concatenated together. The label is what label index we are using (or channel), and the x, y, z are just the 3 dimensions of the image.

So, if you want a 2d image with channels (labels) last, you would index instead using:

[image,x,y,label]

So, that indexing of y_true and y_pred becomes :

[:,:,:,index]

and the line of code that needs to change becomes:

dice -= dice_coef(y_true[:,:,:,index], y_pred[:,:,:,index])

Hope that helps.

@mptorr
Copy link
Contributor

mptorr commented Feb 11, 2020

@gattia thanks so much, that's extremely helpful!
I'll test and report back with results.
thanks again

@naivomah3
Copy link

naivomah3 commented Feb 21, 2020

@mptorr thanks for the interest.

in this line:
dice -= dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])

the indices for y_true and y_pred are:

[images, label, x, y, z]

images is the many/different images I am trying to segment and are concatenated together. The label is what label index we are using (or channel), and the x, y, z are just the 3 dimensions of the image.

So, if you want a 2d image with channels (labels) last, you would index instead using:

[image,x,y,label]

So, that indexing of y_true and y_pred becomes :

[:,:,:,index]

and the line of code that needs to change becomes:

dice -= dice_coef(y_true[:,:,:,index], y_pred[:,:,:,index])

Hope that helps.

I don't know what's wrong but I got negative loss and keeping away from 0 with this implementation despite of the dice score is keeping increasing.
something like,
136/4131 [..............................] - ETA: 19:02 - loss: -1.5337 - dice: 0.6136

@HiphonL
Copy link

HiphonL commented Feb 21, 2020

@gattia @naivomah3 I also got negative loss and keeping away from 0. However, the OA and mIoU indicated that the implementation can work. So, is the negative loss right?

@gattia
Copy link

gattia commented Feb 21, 2020

@HiphonL @naivomah3

Negative dice is correct.
Traditionally, the optimizer is trying to minimize the loss functions (lower score is better). Normal dice doesn’t fit in this paradigm. To satisfy that, you must transform dice in some way. The easiest is to just take the negative of each dice value.

This is being done in the code at:

dice -= dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])

It is purposely subtracting each of the calculated dice scores. So, a more negative number is better.

@sneh-debug
Copy link

@gattia
def dice_coefficient(y_true, y_pred):

intersection = K.sum(K.abs(y_true * y_pred), axis=[-3,-2,-1])
dn = K.sum(K.square(y_true) + K.square(y_pred), axis=[-3,-2,-1]) + 1e-8
return K.mean(2 * intersection / dn, axis=[0,1])

def loss_gt(e=1e-8):
def loss_gt_(y_true, y_pred):
return 1-dice_coefficient(y_true, y_pred)

return loss_gt_
model.compile(
adam(lr=1e-4),
[loss_gt(dice_e), loss_VAE(input_shape, z_mean, z_var, weight_L2=weight_L2, weight_KL=weight_KL)],
metrics=[dice_coefficient]
)
is it right??
it is based on Variational autoencoder BraTS 2018 Andriy Myronenko https://arxiv.org/pdf/1810.11654.pdf
for 4 images i got this output:
Epoch 7/50
1/4 [======>.......................] - ETA: 45s - loss: 1.0687 - Dec_GT_Output_loss: 0.9762 - Dec_VAE_Output_loss: 0.0926 - Dec_GT_Output_dice_coefficient: 0.022/4 [==============>...............] - ETA: 30s - loss: 1.0406 - Dec_GT_Output_loss: 0.9439 - Dec_VAE_Output_loss: 0.0967 - Dec_GT_Output_dice_coefficient: 0.053/4 [=====================>........] - ETA: 15s - loss: 1.0531 - Dec_GT_Output_loss: 0.9606 - Dec_VAE_Output_loss: 0.0925 - Dec_GT_Output_dice_coefficient: 0.034/4 [==============================] - 61s 15s/step - loss: 1.0628 - Dec_GT_Output_loss: 0.9695 - Dec_VAE_Output_loss: 0.0933 - Dec_GT_Output_dice_coefficient: 0.0305 - Dec_VAE_Output_dice_coefficient: 0.5581
Epoch 24/50
1/4 [======>.......................] - ETA: 45s - loss: 1.0252 - Dec_GT_Output_loss: 0.9991 - Dec_VAE_Output_loss: 0.0262 - Dec_GT_Output_dice_2/4 [==============>...............] - ETA: 30s - loss: 1.0249 - Dec_GT_Output_loss: 0.9986 - Dec_VAE_Output_loss: 0.0264 - Dec_GT_Output_dice_3/4 [=====================>........] - ETA: 15s - loss: 1.0237 - Dec_GT_Output_loss: 0.9981 - Dec_VAE_Output_loss: 0.0256 - Dec_GT_Output_dice_4/4 [==============================] - 62s 15s/step - loss: 1.0247 - Dec_GT_Output_loss: 0.9978 - Dec_VAE_Output_loss: 0.0269 - Dec_GT_Output_dice_coefficient: 0.0022 - Dec_VAE_Output_dice_coefficient: 0.8860

@gattia
Copy link

gattia commented Feb 25, 2020

@sneh-debug it’s hard to follow the code and other copied items. Looks like there are defs within other defs, but I can’t really tell.

If you format the code (proper indentation etc) I will try to review.

Initially, looking at the loss, from training printout something seems wrong. Looks like the code tries to do 1- dice to get the loss. So, max dice should be 1 and lower should be better (lowest possible = 0). However, you are getting Dice loss >1, so this seems like a problem.

@sneh-debug
Copy link

@gattia with indentation:
def dice_coefficient(y_true, y_pred):

y_true_f = K.flatten(y_true)

#y_pred_f = K.flatten(y_pred)
intersection = K.sum(K.abs(y_true * y_pred), axis=[-3,-2,-1])
dn = K.sum(K.square(y_true) + K.square(y_pred), axis=[-3,-2,-1]) + 1e-8
return K.mean(2 * intersection / dn, axis=[0,1])

def loss_gt(e=1e-8):
"""
loss_gt(e=1e-8)
------------------------------------------------------
Since keras does not allow custom loss functions to have arguments
other than the true and predicted labels, this function acts as a wrapper
that allows us to implement the custom loss used in the paper. This function
only calculates - L term of the following equation. (i.e. GT Decoder part loss)

L = - L<dice> + weight_L2 ∗ L<L2> + weight_KL ∗ L<KL>

Parameters
----------
`e`: Float, optional
    A small epsilon term to add in the denominator to avoid dividing by
    zero and possible gradient explosion.
    
Returns
-------
loss_gt_(y_true, y_pred): A custom keras loss function
    This function takes as input the predicted and ground labels, uses them
    to calculate the dice loss.
    
"""
def loss_gt_(y_true, y_pred):
    intersection = K.sum(K.abs(y_true * y_pred), axis=[-3,-2,-1])
    dn = K.sum(K.square(y_true) + K.square(y_pred), axis=[-3,-2,-1]) + 1e-8
    
    return -K.mean(2 * intersection / dn, axis=[0,1])

return loss_gt_

def loss_VAE(input_shape, z_mean, z_var, weight_L2=0.1, weight_KL=0.1):
"""
loss_VAE(input_shape, z_mean, z_var, weight_L2=0.1, weight_KL=0.1)
------------------------------------------------------
Since keras does not allow custom loss functions to have arguments
other than the true and predicted labels, this function acts as a wrapper
that allows us to implement the custom loss used in the paper. This function
calculates the following equation, except for -L term. (i.e. VAE decoder part loss)

L = - L<dice> + weight_L2 ∗ L<L2> + weight_KL ∗ L<KL>

Parameters
----------
 `input_shape`: A 4-tuple, required
    The shape of an image as the tuple (c, H, W, D), where c is
    the no. of channels; H, W and D is the height, width and depth of the
    input image, respectively.
`z_mean`: An keras.layers.Layer instance, required
    The vector representing values of mean for the learned distribution
    in the VAE part. Used internally.
`z_var`: An keras.layers.Layer instance, required
    The vector representing values of variance for the learned distribution
    in the VAE part. Used internally.
`weight_L2`: A real number, optional
    The weight to be given to the L2 loss term in the loss function. Adjust to get best
    results for your task. Defaults to 0.1.
`weight_KL`: A real number, optional
    The weight to be given to the KL loss term in the loss function. Adjust to get best
    results for your task. Defaults to 0.1.
    
Returns
-------
loss_VAE_(y_true, y_pred): A custom keras loss function
    This function takes as input the predicted and ground labels, uses them
    to calculate the L2 and KL loss.
    
"""
def loss_VAE_(y_true, y_pred):
    c, H, W, D = input_shape
    n = c * H * W * D
    
    loss_L2 = K.mean(K.square(y_true - y_pred), axis=(1, 2, 3, 4)) # original axis value is (1,2,3,4).

    loss_KL = (1 / n) * K.sum(
        K.exp(z_var) + K.square(z_mean) - 1. - z_var,
        axis=-1
    )

    return weight_L2 * loss_L2 + weight_KL * loss_KL

return loss_VAE_

model.compile(
adam(lr=1e-4),
[loss_gt(dice_e), loss_VAE(input_shape, z_mean, z_var, weight_L2=weight_L2, weight_KL=weight_KL)],
metrics=[dice_coefficient]
)

@xychenunc
Copy link

I am trying something similar for a 2D semantic segmentation project with 10 categories (label 0 is background). Before trying dice, I was using sparse categorical crossentropy with very good results. However, because label 0 was being included in the loss calculation, both training and validation accuracy were artificially high (> 0.98). My implementation of dice is based on this: https://github.com/Lasagne/Recipes/issues/99.

y_true has shape (batch,m,n,1) and y_pred has shape (batch,m,n,10). Here is my version of dice:

def dice_coef_9cat(y_true, y_pred, smooth=1e-7):
    '''
    Dice coefficient for 10 categories. Ignores background pixel label 0
    Pass to model as metric during compile statement
    '''
    y_true_f = K.flatten(K.one_hot(K.cast(y_true, 'int32'), num_classes=10)[...,1:])
    y_pred_f = K.flatten(y_pred[...,1:])
    intersect = K.sum(y_true_f * y_pred_f, axis=-1)
    denom = K.sum(y_true_f + y_pred_f, axis=-1)
    return K.mean((2. * intersect / (denom + smooth)))

def dice_coef_9cat_loss(y_true, y_pred):
    '''
    Dice loss to minimize. Pass to model as loss during compile statement
    '''
    return 1 - dice_coef_9cat(y_true, y_pred)

A model trained with the above implementation of dice tends to predict 4 out of the 9 categories and the segmentation is less than ideal and much worse than I got with sparse categorical crossentropy.

However, when I convert the segmentation task into a binary decision (merge all categories into one), the segmentation is pretty good. Here is the loss function changed for a binary problem:

def dice_coef_binary(y_true, y_pred, smooth=1e-7):
    '''
    Dice coefficient for 2 categories. Ignores background pixel label 0
    Pass to model as metric during compile statement
    '''
    y_true_f = K.flatten(K.one_hot(K.cast(y_true, 'int32'), num_classes=2)[...,1:])
    y_pred_f = K.flatten(y_pred[...,1:])
    intersect = K.sum(y_true_f * y_pred_f, axis=-1)
    denom = K.sum(y_true_f + y_pred_f, axis=-1)
    return K.mean((2. * intersect / (denom + smooth)))


def dice_coef_binary_loss(y_true, y_pred):
    '''
    Dice loss to minimize. Pass to model as loss during compile statement
    '''
    return 1 - dice_coef_binary(y_true, y_pred)

Not sure if the binary results are better because it is an 'easier' task or because my dice loss function is wrong.

I noticed that this implementation of multi-class dice loss could lead to sub-optimal performance in some cases (probably depending on specific dataset and/or architecture design). That is, for some minority class(es), the prediction can be nothing. Have you encountered this problem in your work and how did you solve it? Thanks

@gattia
Copy link

gattia commented Jul 4, 2020

This can sometimes be resolved by tuning other parameters (learning rate, optimizer, etc.)

If you have many classes that are imbalanced it can cause issues or not equally weight them. In this case, you can create weighting schemes that can sometimes help.

@maxvfischer
Copy link

maxvfischer commented Aug 4, 2020

I've implemented a bunch of binary and multi-class loss functions for image segmentation (one-hot encoded masks) that you might find useful: https://github.com/maxvfischer/keras-image-segmentation-loss-functions

E.g.:
Tanimoto loss: https://github.com/maxvfischer/keras-image-segmentation-loss-functions/blob/master/losses/multiclass_losses.py#L8
Dice's coefficient loss: https://github.com/maxvfischer/keras-image-segmentation-loss-functions/blob/master/losses/multiclass_losses.py#L42
Squared Dice's coefficient loss: https://github.com/maxvfischer/keras-image-segmentation-loss-functions/blob/master/losses/multiclass_losses.py#L74

@rohan19250
Copy link

Hello - I am working on 4 class segmentation problem, so I have 4 labels. I am able to get combined dice scores and losses using the functions below:

from keras import backend as K
def dice_coef(y_true, y_pred, epsilon=1e-6):
    intersection = K.sum(K.abs(y_true * y_pred), axis=-1)
    return (2. * intersection) / (K.sum(K.square(y_true),axis=-1) + K.sum(K.square(y_pred),axis=-1) + epsilon)

def dice_coef_loss(y_true, y_pred):
    return 1-dice_coef(y_true, y_pred)

How can I get dice coefficient and dice loss per label instead of a combined dice coefficient (see below)?

Epoch 1/40
100/100 [==============================] - 171s 2s/step - loss: 0.2989 - dice_coef: 0.7011 - val_loss: 0.2194 - val_dice_coef: 0.7806
Epoch 2/40
100/100 [==============================] - 74s 739ms/step - loss: 0.0272 - dice_coef: 0.9728 - val_loss: 0.0457 - val_dice_coef: 0.9543

@maxvfischer
Copy link

maxvfischer commented Dec 10, 2020

@rohan19250 I don't know what optimizer you're using during training, but presuming that you're using a gradient-based optimizer like SGD or ADAM, you want a single loss value to be able to optimize the network.

That being said, if you still want to compute dice_coef and dice_coef_loss for each label seperately, I would add them as Keras metrics.

You could probably try to add something like this (NOTE: I have not tested this code. Think of it as pseudo-code):

def dice_coef_single_label(class_idx: int, name: str, epsilon=1e-6) -> Callable[[tf.Tensor, tf.Tensor], tf.Tensor]:
    def dice_coef(y_true: tf.Tensor, y_pred: tf.Tensor) -> tf.Tensor:
        # Extract single class to compute dice coef
        y_true_single_class = y_true[..., class_idx]
        y_pred_single_class = y_pred[..., class_idx]

        intersection = K.sum(K.abs(y_true_single_class * y_pred_single_class))
        return (2. * intersection) / (K.sum(K.square(y_true_single_class)) + K.sum(K.square(y_pred_single_class)) + epsilon)

    dice_coef.__name__ = f"dice_coef_{name}"  # Set name used to log metric
    return dice_coef

Then compile your model in this fashion:

# The order needs to be the same as the order in your target/label tensor. 
# In this case, you need to have (None, <IMG_HEIGHT>, <IMG_WIDTH>, 4)
classes = ['dog', 'cat', 'horse', 'bird'] 
model.compile(optimizer=<YOUR_OPTIMIZER>,
    loss=<YOUR_LOSS>,
    metrics=[
        *[dice_coef_single_label(class_idx=idx, name=class_name)
        for idx, class_name in enumerate(classes)]
    ]
)

@rohan19250
Copy link

Thanks a lot @maxvfischer! This worked. So we are not summing over the last axis (excluded"axis=-1") in this function for individual labels. Could you explain this a bit?

@maxvfischer
Copy link

maxvfischer commented Dec 10, 2020

@rohan19250

What axis=-1 means is that you're referring to the last axis (without knowing the actual index of the last axis). In your case when computing the dice loss for all labels, the tensors y_true and y_pred are of shape (None, <IMAGE_HEIGHT>, <IMAGE_WIDTH>, 4), where None is the unknown batch size and 4 is the amount of classes you have. By running K.sum(..., axis=-1) you are summing over all the classes (last channel).

But in my code, we're extracting the true values and the predictions for a single class:

y_true_single_class = y_true[..., class_idx]
y_pred_single_class = y_pred[..., class_idx]

The shape will go from

y_true.shape = y_pred.shape = (None, <IMAGE_HEIGHT>, <IMAGE_WIDTH>, 4)

to

y_true_single_class.shape = y_pred_single_class.shape = (None, <IMAGE_HEIGHT>, <IMAGE_WIDTH>)

If you would keep axis=-1, that will refer to the last channel in the tensor, in our case <IMAGE_WIDTH>. By keeping it, you would've summed over the image width, something we don't want to do.

Hope it explains why I removed it.

EDIT:
I just saw that I didn't write that. When you don't supply an axis to K.sum(...), it will sum over all axises

@rohan19250
Copy link

@maxvfischer Got it! this is really helpful.

My Y_val is of shape (2880, 192, 192, 4). I am using the combined dice coefficient loss function for overall network, and just calculating individual dice coefficients.

The functions I defined based on your above code for individual classes:

from keras import backend as K
def dice_coef_necrotic(y_true, y_pred, epsilon=1e-6):
    intersection = K.sum(K.abs(y_true[:,:,:,1] * y_pred[:,:,:,1]))
    return (2. * intersection) / (K.sum(K.square(y_true[:,:,:,1])) + K.sum(K.square(y_pred[:,:,:,1])) + epsilon)

def dice_coef_edema(y_true, y_pred, epsilon=1e-6):
    intersection = K.sum(K.abs(y_true[:,:,:,2] * y_pred[:,:,:,2]))
    return (2. * intersection) / (K.sum(K.square(y_true[:,:,:,2])) + K.sum(K.square(y_pred[:,:,:,2])) + epsilon)
Epoch 1/40
408/408 [==============================] - 298s 729ms/step - loss: 0.0067 - dice_coef: 0.9933 - dice_coef_necrotic: 0.7299 - dice_coef_edema: 0.8579 - dice_coef_enhancingtumor: 0.8162 - val_loss: 0.0318 - val_dice_coef: 0.9682 - val_dice_coef_necrotic: 0.2566 - val_dice_coef_edema: 0.4313 - val_dice_coef_enhancingtumor: 0.3654
Epoch 2/40
408/408 [==============================] - 302s 740ms/step - loss: 0.0064 - dice_coef: 0.9936 - dice_coef_necrotic: 0.7478 - dice_coef_edema: 0.8655 - dice_coef_enhancingtumor: 0.8255 - val_loss: 0.0195 - val_dice_coef: 0.9805 - val_dice_coef_necrotic: 0.2540 - val_dice_coef_edema: 0.4496 - val_dice_coef_enhancingtumor: 0.3562
Epoch 3/40
408/408 [==============================] - 301s 739ms/step - loss: 0.0061 - dice_coef: 0.9939 - dice_coef_necrotic: 0.7633 - dice_coef_edema: 0.8716 - dice_coef_enhancingtumor: 0.8333 - val_loss: 0.0180 - val_dice_coef: 0.9820 - val_dice_coef_necrotic: 0.2647 - val_dice_coef_edema: 0.4883 - val_dice_coef_enhancingtumor: 0.3726

It seems the combined validation dice coefficient is good, but the individual class dice coefficients are not that good (shown first few epochs). Perhaps I should explore other loss functions and data augmentation (Y train ~13000 images)?

@maxvfischer
Copy link

maxvfischer commented Dec 10, 2020

@rohan19250 Impossible for me to answer by the information you've provided.

combined validation dice coefficient should be converging to something "good", because that's what you're optimizing. For how long did you train it? Have you plotted the individual dice coefficients over more epochs? Does it converges to something?

I would probably think more about your problem:

@rohan19250
Copy link

@maxvfischer - I have trained for 40 epochs(Adam optimizer,lr=1e-5). The individual dice coefficients converge to ~.29 for the first class, ~.46 for the second class, and ~.42 for the third class.
The combined validation dice coefficient converges to ~.98
Some of the details about the architecture of the U-net model.

image

Also, I am able to do predictions and see the ground truth and the predicted labels for some of the test images(which is a bit satisfying), but was concerned about the individual dice scores not good which definitely could be worked upon.

I will explore the other suggestions you provided.

@maxvfischer
Copy link

maxvfischer commented Dec 10, 2020

@rohan19250 For interpretability, you might want to use intersection over union/Jaccard Index (https://en.wikipedia.org/wiki/Jaccard_index) for each class as a metric instead of dice coefficient:

def iou_metric(class_idx: int, name: str = 'iou') -> Callable[[tf.Tensor, tf.Tensor], tf.Tensor]:
    def iou(y_true: tf.Tensor, y_pred: tf.Tensor) -> tf.Tensor:
        # Array with True where prob. is highest and False elsewhere
        y_pred_bool = (K.max(x=y_pred, axis=-1, keepdims=True) == y_pred)
        # Generate one-hot vector
        y_pred = tf.where(y_pred_bool, 1., 0.)

        # Extract single class to compute IoU over
        y_true_single_class = y_true[..., class_idx]
        y_pred_single_class = y_pred[..., class_idx]

        # Compute IoU
        intersection = K.sum(y_true_single_class * y_pred_single_class)
        union = K.sum(y_true_single_class) + K.sum(y_pred_single_class) - intersection

        # union = 0 means that we predicted nothing and the ground truth contained no labels.
        # This will be treated as a perfect IoU-score = 1.
        return K.switch(K.equal(union, 0.), 1., intersection / union)

    iou.__name__ = f"iou_{name}"  # Set name used to log metric
    return iou

Good luck!

@Tarandeep97
Copy link

Tarandeep97 commented Apr 17, 2022

@gattia

@lazyleaf, I just stumbled upon this. I am doing 3D segmentation on multiclass. I will definitely try out the proposed method and see how it works. However, I also have another solution that has worked for me in the past:

def dice_coef(y_true, y_pred):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_coef_multilabel(y_true, y_pred, numLabels=5):
    dice=0
    for index in range(numLabels):
        dice -= dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])
    return dice

This simply calculates the dice score for each individual label, and then sums them together, and includes the background. The best dice score you will ever get is equal to numLables*-1.0. When monitoring I always keep in mind that the dice for the background is almost always near 1.0.

Won't there been a weight for each label multiplied to dice_coef, weight that depicts the number of pixels for that label ?

@gattia
Copy link

gattia commented Apr 17, 2022

What you are describing is one version of the DSC that has been called generalized-dsc.

I think they inversely weighted based on the number of pixels labeled that class.

weight = 1/ sum(y_true[:,index,:,:,:])

dice -= weight * dice_coef(y_true[:,index,:,:,:], y_pred[:,index,:,:,:])

the above would replace the code in the for loop to inversely weight it based on the number of pixels labeled that particular class (in the ground truth).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests