Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement ? New metric for source separation, measuring separately bleed and fullness in separated audio #79

Open
jarredou opened this issue Oct 28, 2024 · 3 comments

Comments

@jarredou
Copy link

jarredou commented Oct 28, 2024

Hi,

I've found a simple way to objectively measure bleed and fullness in context of music source separation that I think could be useful as I haven't seen any existing objective metric doing this, while it's a common question from users.

Here is code as a metric:

def bleed_full(ref, est, sr=44100):
    # STFT parameters
    n_fft = 4096
    hop_length = 1024
    n_mels = 512

    # Compute Mag STFTs
    D1 = np.abs(librosa.stft(ref, n_fft=n_fft, hop_length=hop_length))
    D2 = np.abs(librosa.stft(est, n_fft=n_fft, hop_length=hop_length))

    # Convert to mel spectrograms
    mel_basis = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels)
    S1_mel = np.dot(mel_basis, D1)
    S2_mel = np.dot(mel_basis, D2)

    # Convert to decibels
    S1_db = librosa.amplitude_to_db(S1_mel)
    S2_db = librosa.amplitude_to_db(S2_mel)
    
    # Calculate difference
    diff = S2_db - S1_db

    # Separate positive and negative differences
    positive_diff = diff[diff > 0]
    negative_diff = diff[diff < 0]

    # Calculate averages
    average_positive = np.mean(positive_diff) if len(positive_diff) > 0 else 0
    average_negative = np.mean(negative_diff) if len(negative_diff) > 0 else 0
    
    # Scale with 100 as best score
    bleedness = 100  / (average_positive + 1)
    fullness = 100 / (-average_negative + 1)

    return bleedness, fullness

I guess it can be adapted as losses, but I'm not dev/scientist and I'm lacking knowledge to make it bulletproof, if it worth it, you should know better than me.

Same concept can be used to draw spectrograms with, for example: bleed/positive values (red), missing content/negative values (blue), perfect separation = 0 (white):
image

@jarredou jarredou changed the title New metric for source separation, measuring separately bleed and fullness in separated audio Enhancement ? New metric for source separation, measuring separately bleed and fullness in separated audio Oct 29, 2024
@turian
Copy link
Contributor

turian commented Nov 9, 2024

@jarredou I'm curious about this. So basically:

Instead of doing l1 mel spectral distance, you separate it into two components:

  1. Bleed = anything ADDED to the target spectrogram
  2. -Fullness = anything REMOVED from the target spectrogram

I see you do MSS work. I noted in the BS-Roformer paper that the authors wrote: "our model outputs gained more preference from musicians and educators than from music producers in the listening test of SDX23". To my ears, bs-roformers seem to have have less bleed but less fullness. I'd be curious if you have any numbers to share. (cc @ZFTurbo )

@jarredou
Copy link
Author

jarredou commented Nov 18, 2024

@turian Yeah, that's the simple idea behind the 2 metrics.

About the BS-Rofomer quote, it's from this final paper from SDX/MDX23 contest https://arxiv.org/pdf/2308.06979

We don't have numbers between different neural network models. For now, the metrics was only used to evaluate different fine-tuned versions made on top of Kimberley's Melband-Rofomer model the results are accessible here https://docs.google.com/spreadsheets/d/1pPEJpu4tZjTkjPh_F5YjtIyHq8v0SxLnBydfUBUNlbI/edit and it was made using mvsep.com multisong eval dataset.

ZFTurbo has added the torch version of the metric to his training script a few days ago.

@jarredou
Copy link
Author

jarredou commented Nov 24, 2024

Little update: The metric was used as loss, to emphasized fullness on a vocals model and it does great job in the said task, especially on extracting the reverb more fully (also more clarity in the vocals consonants in high frequency range) :
1st pic is Kim's original model, 2nd one is the finetuned version emphasizing on vocals fullness (at a cost of a bit more noisy separation too):
kim fullness
unwa bigbeta5e fullness

(all these experiments are done inside the Audio Separation discord community (invite: https://discord.gg/ndC4UmPZwZ)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants