Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Casanovo-DB Functionality #325

Open
wants to merge 67 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
258edb4
begin adding tests for annotate mode
VarunAnanth2003 Mar 27, 2024
30f5984
add basic test for annotate mode
VarunAnanth2003 Mar 29, 2024
186bc0f
added test case for annotate mode and modified method
VarunAnanth2003 Apr 9, 2024
a8f50f4
very rough sketch of db upgrade (untested)
VarunAnanth2003 Apr 11, 2024
dae9c8a
small upgrades to documentation
VarunAnanth2003 Apr 15, 2024
7f95ae5
better output formatting
VarunAnanth2003 Apr 15, 2024
278436b
all tests added
VarunAnanth2003 Apr 27, 2024
949ea93
remove minor debugging print statement
VarunAnanth2003 Apr 27, 2024
da5ef5e
Generate new screengrabs with rich-codex
github-actions[bot] Apr 27, 2024
53f6bec
remove excess info logs, add monkeypatch to tests
VarunAnanth2003 Apr 27, 2024
e2cbce8
Merge branch 'dev_db_search' of https://github.com/Noble-Lab/casanovo…
VarunAnanth2003 Apr 27, 2024
81aa073
mp fix
VarunAnanth2003 Apr 27, 2024
0ecbd80
fix line lengths and modify test
VarunAnanth2003 May 7, 2024
ee6638e
Generate new screengrabs with rich-codex
github-actions[bot] May 7, 2024
392ccaf
Merge branch 'dev' into dev_db_search
VarunAnanth2003 May 7, 2024
2d57513
justins requested fixes
VarunAnanth2003 May 7, 2024
3cfb795
added minor changes as requested by Wout
VarunAnanth2003 Jun 17, 2024
49f44ad
partial fixes requested by wout. Lots of subclassing removed
VarunAnanth2003 Jun 18, 2024
d967c42
documentation fixes and starting to cleanup batching code
VarunAnanth2003 Jun 18, 2024
ea1f97d
cleaned up on_predict_batch_end, TODOs for calc_mz
VarunAnanth2003 Jun 19, 2024
8825506
add proper calc_mz calculation with depthcharge
VarunAnanth2003 Jun 27, 2024
f25ace8
rough implementation
VarunAnanth2003 Jul 2, 2024
f7dfbc8
tested implementation of db search
VarunAnanth2003 Jul 3, 2024
e2ce317
fix for issue with 0 candidates
VarunAnanth2003 Jul 3, 2024
5ef27e0
minor fixes added
VarunAnanth2003 Jul 3, 2024
5f0675f
reordered and renamed variables for consistency
VarunAnanth2003 Jul 3, 2024
b4fd8ff
casanovo-db full working version with code simplification
VarunAnanth2003 Jul 4, 2024
35ba7d4
Generate new screengrabs with rich-codex
github-actions[bot] Jul 4, 2024
f8a1a89
fix batching issues
VarunAnanth2003 Jul 8, 2024
fe8794d
Merge branch 'db_search_full' of https://github.com/Noble-Lab/casanov…
VarunAnanth2003 Jul 8, 2024
7cb8e14
small fixes regarding documentation, import syntax, etc.
VarunAnanth2003 Aug 12, 2024
b2f08ac
add proteindatabase
VarunAnanth2003 Aug 20, 2024
3d0b0b9
Generate new screengrabs with rich-codex
github-actions[bot] Aug 20, 2024
812226e
finish proteindatabase
VarunAnanth2003 Aug 21, 2024
df68c1d
Merge branch 'db_search_full' of https://github.com/Noble-Lab/casanov…
VarunAnanth2003 Aug 21, 2024
cfd39e8
all comments addressed
VarunAnanth2003 Aug 23, 2024
106c4ec
new comments addressed
VarunAnanth2003 Aug 28, 2024
0dfdb2c
final adjustments added
VarunAnanth2003 Sep 3, 2024
4a5b238
minor changes regarding formatting and small efficiency boosts
VarunAnanth2003 Sep 3, 2024
4352bbd
changes before reformatting config
VarunAnanth2003 Sep 3, 2024
ddff67f
replace all occurences of "max_length" with "max_peptide_len"
VarunAnanth2003 Sep 3, 2024
a3548d0
added nonspecific digestion
VarunAnanth2003 Sep 3, 2024
e8d4682
minor comments
VarunAnanth2003 Sep 13, 2024
68b6926
full branch comments addressed
VarunAnanth2003 Sep 13, 2024
8153ffa
Merge pull request #352 from Noble-Lab/db_search_full
VarunAnanth2003 Sep 13, 2024
dc7d5f4
Merge branch 'dev' into dev_db_search
VarunAnanth2003 Sep 13, 2024
e8c9c7d
Generate new screengrabs with rich-codex
github-actions[bot] Sep 13, 2024
e474eee
updated and fixed failed tests
VarunAnanth2003 Sep 13, 2024
4e696b4
add mztab validation to dbsearch test
VarunAnanth2003 Sep 14, 2024
9a24817
Merge branch 'dev' into dev_db_search
VarunAnanth2003 Sep 17, 2024
4655452
lint fix
VarunAnanth2003 Sep 17, 2024
5e1b9d7
fix integration test
VarunAnanth2003 Sep 17, 2024
4d6b726
fix unit tests
VarunAnanth2003 Sep 17, 2024
5d82e1f
Merge branch 'dev' into dev_db_search
VarunAnanth2003 Sep 20, 2024
e7f0fdc
force fix test
VarunAnanth2003 Sep 21, 2024
813fac0
clean up test_digest_fasta_enzyme
VarunAnanth2003 Sep 21, 2024
310c3fd
adjust test_digest_fasta_mods
VarunAnanth2003 Sep 21, 2024
1651fd5
Merge branch 'dev' into dev_db_search
VarunAnanth2003 Oct 2, 2024
775def7
allows top_match filtering for casanovo-db
VarunAnanth2003 Oct 2, 2024
e35c60d
change default value for protein value in PepSpecMatch
VarunAnanth2003 Oct 2, 2024
79cba59
reverse issues with decoder
VarunAnanth2003 Oct 2, 2024
c9eb8b7
update test and remove logging statement
VarunAnanth2003 Oct 2, 2024
2b6198b
Merge branch 'dev' into dev_db_search
VarunAnanth2003 Nov 3, 2024
68e67e8
db_utils fixes
VarunAnanth2003 Nov 3, 2024
d01dd7f
updates to dataloaders, model_runner, and model.py
VarunAnanth2003 Nov 3, 2024
d581944
near final changes for all but db_utils
VarunAnanth2003 Nov 3, 2024
092fa2a
line length fixes
VarunAnanth2003 Nov 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 61 additions & 2 deletions casanovo/casanovo.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,9 @@ def __init__(self, *args, **kwargs) -> None:
click.Option(
("-m", "--model"),
help="""
Either the model weights (.ckpt file) or a URL pointing to the model weights
file. If not provided, Casanovo will try to download the latest release automatically.
Either the model weights (.ckpt file) or a URL pointing to
the model weights file. If not provided,
Casanovo will try to download the latest release automatically.
""",
),
click.Option(
Expand Down Expand Up @@ -201,6 +202,64 @@ def sequence(
)


@main.command(cls=_SharedParams)
@click.argument(
"peak_path",
required=True,
nargs=-1,
type=click.Path(exists=True, dir_okay=False),
)
@click.argument(
"fasta_path",
required=True,
nargs=1,
type=click.Path(exists=True, dir_okay=False),
)
def db_search(
peak_path: Tuple[str],
fasta_path: str,
model: Optional[str],
config: Optional[str],
output_dir: Optional[str],
output_root: Optional[str],
verbosity: str,
force_overwrite: bool,
) -> None:
"""Perform a database search on MS/MS data using Casanovo-DB.

PEAK_PATH must be one or more mzML, mzXML, or MGF files.
FASTA_PATH must be one FASTA file.
"""
output_path, output_root_name = _setup_output(
output_dir, output_root, force_overwrite, verbosity
)
utils.check_dir_file_exists(output_path, f"{output_root}.mztab")
config, model = setup_model(
model, config, output_path, output_root_name, False
)
start_time = time.time()
with ModelRunner(
config,
model,
output_path,
output_root_name if output_root is not None else None,
False,
) as runner:
logger.info("Performing database search on:")
for peak_file in peak_path:
logger.info(" %s", peak_file)

logger.info("Using the following FASTA file:")
logger.info(" %s", fasta_path)

runner.db_search(
peak_path,
fasta_path,
str((output_path / output_root).with_suffix(".mztab")),
)
utils.log_run_report(start_time=start_time, end_time=time.time())


@main.command(cls=_SharedParams)
@click.argument(
"train_peak_path",
Expand Down
3 changes: 2 additions & 1 deletion casanovo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
_config_deprecated = dict(
every_n_train_steps="val_check_interval",
max_iters="cosine_schedule_period_iters",
max_length="max_peptide_len",
save_top_k=None,
model_save_folder_path=None,
)
Expand Down Expand Up @@ -61,7 +62,7 @@ class Config:
n_layers=int,
dropout=float,
dim_intensity=int,
max_length=int,
max_peptide_len=int,
residues=dict,
n_log=int,
tb_summarywriter=bool,
Expand Down
50 changes: 43 additions & 7 deletions casanovo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,22 @@

###
# The following parameters can be modified when running inference or when
# fine-tuning an existing Casanovo model.
# fine-tuning an existing Casanovo model. They also affect database search
# parameters when running Casanovo in DB-search mode.
###

# Max absolute difference allowed with respect to observed precursor m/z.
# Predictions outside the tolerance range are assigned a negative peptide score.
# denovo: Predictions outside the tolerance range are assigned a negative peptide score.
# db-search: Select candidate peptides within the specified precursor m/z tolerance.
precursor_mass_tol: 50 # ppm
# Isotopes to consider when comparing predicted and observed precursor m/z's.
isotope_error_range: [0, 1]
# The minimum length of predicted peptides.
# The minimum length of considered peptides.
min_peptide_len: 6
# The maximum length of considered peptides.
max_peptide_len: 100
# Number of spectra in one inference batch.
predict_batch_size: 1024
# Number of beams used in beam search.
n_beams: 1
# Number of PSMs for each spectrum.
top_match: 1
# The hardware accelerator to use. Must be one of:
Expand All @@ -29,6 +31,42 @@ accelerator: "auto"
# number will be automatically selected for based on the chosen accelerator.
devices:


###
# The following parameters are unique to Casanovo's inference/finetuning mode.
VarunAnanth2003 marked this conversation as resolved.
Show resolved Hide resolved
###

# Number of beams used in beam search.
n_beams: 1


###
# The following parameters are unique to Casanovo's database search mode.
###

# Enzyme for in silico digestion, used to generate candidate peptides.
# See pyteomics.parser.expasy_rules for valid enzymes.
# Can also take a regex to specify custom digestion rules.
enzyme: "trypsin"
# Digestion type for candidate peptide generation.
# full: standard digestion.
# semi: Include products of semi-specific cleavage.
# non-specific: Include products of non-specific cleavage.
digestion: "full"
# Number of allowed missed cleavages when digesting protein.
missed_cleavages: 0
# Maximum number of variable amino acid modifications per peptide,
# None generates all possible isoforms as candidates.
max_mods: 1
# Select which modifications from the vocabulary can be used in candidate creation.
# Format: Comma-separated list of "aa:mod_residue",
# where aa is a standard amino acid (or "nterm" for an N-terminal mod)
# and mod_residue is a key from the "residues" dictionary.
# Example: "M:M+15.995,nterm:+43.006"
allowed_fixed_mods: "C:C+57.021"
allowed_var_mods: "M:M+15.995,N:N+0.984,Q:Q+0.984,nterm:+42.011,nterm:+43.006,nterm:-17.027,nterm:+43.006-17.027"


###
# The following parameters should only be modified if you are training a new
# Casanovo model from scratch.
Expand Down Expand Up @@ -77,8 +115,6 @@ dropout: 0.0
# Number of dimensions to use for encoding peak intensity.
# Projected up to `dim_model` by default and summed with the peak m/z encoding.
dim_intensity:
# Max decoded peptide length.
max_length: 100
# The number of iterations for the linear warm-up of the learning rate.
warmup_iters: 100_000
# The number of iterations for the cosine half period of the learning rate.
Expand Down
Loading
Loading