Skip to content

Support direct monomer A3M inputs for MSA pairing#313

Open
taivu1998 wants to merge 1 commit into
bytedance:mainfrom
taivu1998:tdv/issue-191-monomer-msa
Open

Support direct monomer A3M inputs for MSA pairing#313
taivu1998 wants to merge 1 commit into
bytedance:mainfrom
taivu1998:tdv/issue-191-monomer-msa

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

  • Adds direct monomer A3M input support through the new monomerMsaPath JSON field.
  • Splits each provided monomer A3M into Protenix-compatible unpaired and pairable MSA streams in memory, so users can reuse ColabFold-style monomer MSAs without a separate preprocessing step.
  • Preserves existing split MSA path precedence and only uses monomerMsaPath when explicit unpairedMsaPath or pairedMsaPath inputs are absent.

Problem

Issue #191 asks for support for users who already have separate monomer A3M files and want Protenix to use them directly. Before this change, inference accepted explicit unpaired and paired MSA paths, but a single monomer A3M could not be supplied as a direct input source for both regular MSA features and species-aware pairing.

Changes

  • Adds protenix.data.msa.msa_input with a small parser/splitter for monomer A3M inputs.
  • Extracts all query-length-compatible records into the unpaired stream, while adding records with extractable species or taxonomy IDs to the paired stream.
  • Extends species ID extraction to understand existing Protenix/UniProt-style headers plus explicit taxonomy tags such as OX=, TaxID=, taxid=, and Tax=.
  • Updates inference MSA featurization to consume monomerMsaPath only when explicit split paths are not provided, with warnings for skipped invalid rows or monomer MSAs that cannot contribute pairable rows.
  • Updates runner MSA-search detection so valid monomerMsaPath inputs skip redundant MSA search, while missing monomer paths fall back to search unless split paths are supplied.
  • Documents the new input field and adds an example multi-chain JSON plus tiny example A3M files.
  • Adds focused unit coverage for parsing, taxonomy extraction, precedence, and inference feature generation.

Validation

  • uvx ruff check protenix/data/msa/msa_input.py protenix/data/msa/msa_utils.py protenix/data/msa/msa_featurizer.py runner/msa_search.py tests/test_msa_input.py tests/test_inference_monomer_msa.py
  • uvx ruff format --check protenix/data/msa/msa_input.py protenix/data/msa/msa_utils.py protenix/data/msa/msa_featurizer.py runner/msa_search.py tests/test_msa_input.py tests/test_inference_monomer_msa.py
  • python3 -m py_compile protenix/data/msa/msa_input.py protenix/data/msa/msa_utils.py protenix/data/msa/msa_featurizer.py runner/msa_search.py tests/test_msa_input.py tests/test_inference_monomer_msa.py
  • uv run --python 3.11 --with pytest --with numpy --with pandas --with rdkit --with biotite pytest tests/test_msa_encoding.py tests/test_msa_input.py tests/test_inference_monomer_msa.py
  • git diff --cached --check
  • python3 -m json.tool examples/example_monomer_msa_path.json

Addresses #191.

@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant