Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the data used for ML models training #5

Open
robertoimuno opened this issue Feb 23, 2024 · 0 comments
Open

Questions about the data used for ML models training #5

robertoimuno opened this issue Feb 23, 2024 · 0 comments

Comments

@robertoimuno
Copy link

Hi,

I hope this message finds you well. My name is Roberto, and I've been reading your paper recently, a very interesting study.

I have a few questions about the utilization of NGS files containing antibodies categorized as binders and non-binders. In the 'scripts/main.py' file of the repository, there are the following lines of code:

# Load non-binding sequences
ab_neg_files = [
    'mHER_H3_1_Ab.txt', 'mHER_H3_1_AgN.txt',
    'mHER_H3_2_Ab.txt', 'mHER_H3_2_AgN.txt',
    'mHER_H3_3_Ab.txt', 'mHER_H3_3_AgN.txt'
]
mHER_H3_AgNeg = load_input_data(ab_neg_files, Ag_class=0)
# Load binding sequences
ab_pos_files = [
    'mHER_H3_1_2Ag647.txt', 'mHER_H3_1_2Ag488.txt',
    'mHER_H3_2_2Ag647.txt', 'mHER_H3_2_2Ag488.txt',
    'mHER_H3_3_2Ag647.txt', 'mHER_H3_3_2Ag488.txt'
]
mHER_H3_AgPos = load_input_data(ab_pos_files, Ag_class=1)

If I understand correctly, these lines of code read NGS-processed files for binders and non-binders from three distinct DMS enrichment rounds, merging them to create two files: 'data/mHER_H3_AgNeg.csv' and 'data/mHER_H3_AgPos.csv'. Subsequently, in a later step, these files are concatenated, and the class ratio is adjusted. Notably, a subset of non-binders is removed during this adjustment, only to be reintroduced incrementally into the training set later.

The incremental imbalance of the training dataset is only done when benchmarking the different ML algorithms, in a later step when training only the CNN the training set is not incremented.

Upon looking into the incremented datasets employed for model training, I observed that, except for the training set with a binders-to-non-binders ratio of 0.5, all other sets contain some instances where the same CDR3H sequence appears with different labels. I guess this happens because an antibody can be a binder for instance in the results coming from the first DMS enrichment round but is a non-binder in the last.

This is an image showing an example of two sequences in which this happens:

image

- Is this observation correct? Does this labeling duality serve a specific purpose in the training process? If so, could you shed light on the benefits of adopting this approach when benchmarking the ML algorithms?

- Could you share your perspective on the advantages of incorporating data from different rounds of DMS enrichment in the training set?

Thank you in advance for your time and insights.
Kind regards,
Roberto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant