Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
hollenstein committed Apr 19, 2024
2 parents e9e1ff4 + c8d25e6 commit 20dfce5
Show file tree
Hide file tree
Showing 19 changed files with 717 additions and 54 deletions.
32 changes: 32 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# This workflow will install the profasta package and its dependencies and run tests with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Python package

on:
push:
branches: ["main", "develop"]
pull_request:

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/[email protected]
- name: Set up Python ${{ matrix.python-version }}
uses: actions/[email protected]
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[tests]
- name: Test with pytest
run: |
python -m pytest
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ __pycache__/
*.py[cod]
*$py.class


# C extensions
*.so

Expand Down Expand Up @@ -104,6 +105,9 @@ venv.bak/
### VisualStudioCode ###
.vscode/

### visual studio ###
.vs/

# Local History for Visual Studio Code
.history/

Expand Down
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Changelog

----------------------------------------------------------------------------------------

## Version [0.0.5]
Released: 2024-04-19

### Added
- Add otpion to `db.ProteinDatabase.add_fasta` that allows skipping entries which headers could not be parsed, instead of raising a `ValueError`. (Suggested by @xeniorn)
- Added `keys`, `values`, and `items` methods to `db.ProteinDatabase` to allow more convenient iteration over the database's entries.

### Changed
- Made `decoy.reverse_sequence` a private function.
- Renamed the protocol classes `HeaderParser` and `HeaderWriter` to `AbstractHeaderParser` and `AbstractHeaderWriter` to be consistent with the naming of the other abstract classes. (Suggested by @xeniorn)

### Fixed
- Parsing a FASTA file returned invalid protein sequences when the sequence contained a terminal `*` character or lowercase letters. Terminal `*` characters are now removed from the sequence and the sequence is capitalized. (Contributed by @xeniorn)

### Chores
- Added a GitHub Actions CI workflow for automated testing. (Contributed by @xeniorn)
- Minor corrections and additions to some docstrings.
- Added a Jupyter notebook containing usage examples for the ProFASTA library.
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# ProFASTA
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2Fhollenstein%2Fprofasta%2Fmain%2Fpyproject.toml)
[![pypi](https://img.shields.io/pypi/v/profasta)](https://pypi.org/project/profasta)
[![unit-tests](https://github.com/hollenstein/profasta/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/hollenstein/profasta/actions/workflows/python-package.yml)

## Introduction
ProFASTA is a Python library for working with FASTA files containing protein records. Unlike other packages, ProFASTA prioritizes simplicity, while aiming to provide a set of useful features required in the field of proteomics based mass spectrometry.
Expand All @@ -21,14 +24,16 @@ The following code snippet shows how to import a FASTA file containing UniProt p
```python
>>> import profasta
>>>
>>> fasta_path = "./example_data/uniprot_hsapiens_10entries.fasta"
>>> fasta_path = "./examples/uniprot_hsapiens_10entries.fasta"
>>> db = profasta.db.ProteinDatabase()
>>> db.add_fasta(fasta_path, header_parser="uniprot")
>>> protein_record = db["O75385"]
>>> print(protein_record.header_fields["gene_name"])
ULK1
```

For more examples how to use the ProFASTA library please refer to the [code snippets](examples/code_snippets.ipynb) Jupyter notebook.

## Requirements
Python >= 3.9

Expand All @@ -53,7 +58,7 @@ pip uninstall profasta
- [x] built-in parser for uniprot format
- [x] allow user defined parser
- [x] write FASTA file
-[x] allow custom FASTA header generation
- [x] allow custom FASTA header generation

**Additional features**
- [x] read multiple FASTA files and write a combined file
Expand All @@ -62,3 +67,6 @@ pip uninstall profasta
- [x] add decoy protein records to an existing FASTA file
- [ ] validate FASTA file / FASTA records

## Contributors

- Juraj Ahel - [@xeniorn](https://github.com/xeniorn)
172 changes: 172 additions & 0 deletions examples/code_snippets.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "273cf753",
"metadata": {},
"source": [
"# Code snippets for working with the proFASTA library"
]
},
{
"cell_type": "markdown",
"id": "8d8a7af6",
"metadata": {},
"source": [
"## Removing invalid characters from imported protein sequences"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "830d37b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MEPG\n"
]
}
],
"source": [
"import profasta\n",
"\n",
"\n",
"def cleanup_protein_sequences(\n",
" db: profasta.ProteinDatabase, alphabet=\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n",
" ) -> None:\n",
" \"\"\"Remove non-alphabet characters from protein sequences in the ProteinDatabase.\n",
" \n",
" Args:\n",
" db: A profasta.ProteinDatabase instance.\n",
" alphabet: List of characters that are allowed in the protein entry sequences.\n",
" \"\"\"\n",
" for entry in db.values(): \n",
" entry.sequence = \"\".join([aa for aa in entry.sequence if aa in alphabet])\n",
"\n",
"\n",
"fasta_path = \"./uniprot_hsapiens_10entries.fasta\"\n",
"db = profasta.db.ProteinDatabase()\n",
"db.add_fasta(fasta_path, header_parser=\"uniprot\")\n",
"db[\"O75385\"].sequence = \"MEPG_-+123\"\n",
"cleanup_protein_sequences(db)\n",
"\n",
"print(db[\"O75385\"].sequence)"
]
},
{
"cell_type": "markdown",
"id": "0c5bea99",
"metadata": {},
"source": [
"## Converting FASTA headers into a UniProt like format"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1996bde3",
"metadata": {},
"outputs": [],
"source": [
"import profasta\n",
"import profasta.parser\n",
"\n",
"\n",
"class CustomHeaderParser:\n",
" \"\"\"Custom header parser.\"\"\"\n",
"\n",
" @classmethod\n",
" def parse(cls, header: str) -> profasta.parser.ParsedHeader:\n",
" \"\"\"Parse a FASTA header string into a ParsedHeader object.\n",
" \n",
" Header format example:\n",
" >ProteinID hypothetical protein name\n",
" \"\"\"\n",
" split_header = header.split(maxsplit=1)\n",
" _id = split_header[0]\n",
"\n",
" fields = {\n",
" \"db\": \"xx\",\n",
" \"identifier\": _id,\n",
" \"entry_name\": f\"{_id}_CUSTOM\",\n",
" \"gene_name\": _id,\n",
" }\n",
" if len(split_header) > 1:\n",
" fields[\"protein_name\"] = split_header[1]\n",
" return profasta.parser.ParsedHeader(_id, header, fields)\n",
"\n",
"# Register the custom header parser so that it can be used by the ProteinDatabase.\n",
"profasta.parser.register_parser(\"custom_parser\", CustomHeaderParser)\n",
"\n",
"fasta_path = \"./custom_header_format.fasta\"\n",
"converted_fasta_path = \"./custom_header_format.uniprot-like.fasta\"\n",
"protein_db = profasta.ProteinDatabase()\n",
"\n",
"# Specify the custom header parser to use for adding the FASTA file.\n",
"protein_db.add_fasta(fasta_path, header_parser=\"custom_parser\")\n",
"\n",
"# Write the ProteinDatabase to a new FASTA file using the uniprot-like header writer.\n",
"protein_db.write_fasta(converted_fasta_path, header_writer=\"uniprot_like\")"
]
},
{
"cell_type": "markdown",
"id": "697f8065",
"metadata": {},
"source": [
"## Create a combined FASTA file with added decoy entries\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "bc15636a",
"metadata": {},
"outputs": [],
"source": [
"import profasta\n",
"\n",
"fasta_path = \"./uniprot_hsapiens_10entries.fasta\"\n",
"decoy_fasta_path = \"./uniprot_hsapiens_10entries_DECOY.fasta\"\n",
"\n",
"# Import the FASTA file\n",
"db = profasta.db.ProteinDatabase()\n",
"db.add_fasta(fasta_path, header_parser=\"uniprot\")\n",
"\n",
"# Create the new FASTA file and write the original entries to it.\n",
"db.write_fasta(decoy_fasta_path, header_writer=\"uniprot\")\n",
"\n",
"# Create a decoy database from the original database, containing reversed sequences.\n",
"decoy_db = profasta.create_decoy_db(db, keep_nterm_methionine=True)\n",
"\n",
"# Append the decoy entries to the new FASTA file.\n",
"decoy_db.write_fasta(decoy_fasta_path, header_writer=\"decoy\", append=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 3 additions & 0 deletions examples/custom_header_format.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
>ProteinID hypothetical protein name
MAWTPLFLFLLTCCPGSNSQAVVTQEPSLTVSPGGTVTLTCGSSTGAVTSGHYPYWFQQK
PGQAPRTLIYDTSNKHSWTPARFSGSLLGGKAALTLLGAQPEDEAEYYCLLSYSGAR
3 changes: 3 additions & 0 deletions examples/custom_header_format.uniprot-like.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
>xx|ProteinID|ProteinID_CUSTOM hypothetical protein name GN=ProteinID
MAWTPLFLFLLTCCPGSNSQAVVTQEPSLTVSPGGTVTLTCGSSTGAVTSGHYPYWFQQK
PGQAPRTLIYDTSNKHSWTPARFSGSLLGGKAALTLLGAQPEDEAEYYCLLSYSGAR
File renamed without changes.
Loading

0 comments on commit 20dfce5

Please sign in to comment.