Merge branch 'develop'

hollenstein · Apr 19, 2024 · 20dfce5 · 20dfce5
2 parents e9e1ff4 + c8d25e6
commit 20dfce5
Show file tree

Hide file tree

Showing 19 changed files with 717 additions and 54 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -0,0 +1,32 @@
+# This workflow will install the profasta package and its dependencies and run tests with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: Python package
+
+on:
+  push:
+    branches: ["main", "develop"]
+  pull_request:
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
+
+    steps:
+    - uses: actions/[email protected]
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/[email protected]
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install .[tests]
+    - name: Test with pytest
+      run: |
+        python -m pytest
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,7 @@ __pycache__/
 *.py[cod]
 *$py.class
 
+
 # C extensions
 *.so
 
@@ -104,6 +105,9 @@ venv.bak/
 ### VisualStudioCode ###
 .vscode/
 
+### visual studio ###
+.vs/
+
 # Local History for Visual Studio Code
 .history/
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,22 @@
+# Changelog
+
+----------------------------------------------------------------------------------------
+
+## Version [0.0.5]
+Released: 2024-04-19
+
+### Added
+- Add otpion to `db.ProteinDatabase.add_fasta` that allows skipping entries which headers could not be parsed, instead of raising a `ValueError`. (Suggested by @xeniorn)
+- Added `keys`, `values`, and `items` methods to `db.ProteinDatabase` to allow more convenient iteration over the database's entries.
+
+### Changed
+- Made `decoy.reverse_sequence` a private function.
+- Renamed the protocol classes `HeaderParser` and `HeaderWriter` to `AbstractHeaderParser` and `AbstractHeaderWriter` to be consistent with the naming of the other abstract classes. (Suggested by @xeniorn)
+
+### Fixed
+- Parsing a FASTA file returned invalid protein sequences when the sequence contained a terminal `*` character or lowercase letters. Terminal `*` characters are now removed from the sequence and the sequence is capitalized. (Contributed by @xeniorn)
+
+### Chores
+- Added a GitHub Actions CI workflow for automated testing. (Contributed by @xeniorn)
+- Minor corrections and additions to some docstrings.
+- Added a Jupyter notebook containing usage examples for the ProFASTA library.
diff --git a/README.md b/README.md
@@ -1,5 +1,8 @@
 # ProFASTA
 [![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
+![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2Fhollenstein%2Fprofasta%2Fmain%2Fpyproject.toml)
+[![pypi](https://img.shields.io/pypi/v/profasta)](https://pypi.org/project/profasta)
+[![unit-tests](https://github.com/hollenstein/profasta/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/hollenstein/profasta/actions/workflows/python-package.yml)
 
 ## Introduction
 ProFASTA is a Python library for working with FASTA files containing protein records. Unlike other packages, ProFASTA prioritizes simplicity, while aiming to provide a set of useful features required in the field of proteomics based mass spectrometry. 
@@ -21,14 +24,16 @@ The following code snippet shows how to import a FASTA file containing UniProt p
 ```python
 >>> import profasta
 >>> 
->>> fasta_path = "./example_data/uniprot_hsapiens_10entries.fasta"
+>>> fasta_path = "./examples/uniprot_hsapiens_10entries.fasta"
 >>> db = profasta.db.ProteinDatabase()
 >>> db.add_fasta(fasta_path, header_parser="uniprot")
 >>> protein_record = db["O75385"]
 >>> print(protein_record.header_fields["gene_name"])
 ULK1
 ```
 
+For more examples how to use the ProFASTA library please refer to the [code snippets](examples/code_snippets.ipynb) Jupyter notebook.
+
 ## Requirements
 Python >= 3.9
 
@@ -53,7 +58,7 @@ pip uninstall profasta
     - [x] built-in parser for uniprot format
     - [x] allow user defined parser
 - [x] write FASTA file
-    -[x] allow custom FASTA header generation
+    - [x] allow custom FASTA header generation
 
 **Additional features**
 - [x] read multiple FASTA files and write a combined file
@@ -62,3 +67,6 @@ pip uninstall profasta
     - [x] add decoy protein records to an existing FASTA file
 - [ ] validate FASTA file / FASTA records
 
+## Contributors
+
+- Juraj Ahel - [@xeniorn](https://github.com/xeniorn)
diff --git a/examples/code_snippets.ipynb b/examples/code_snippets.ipynb
@@ -0,0 +1,172 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "273cf753",
+   "metadata": {},
+   "source": [
+    "# Code snippets for working with the proFASTA library"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d8a7af6",
+   "metadata": {},
+   "source": [
+    "## Removing invalid characters from imported protein sequences"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "830d37b9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "MEPG\n"
+     ]
+    }
+   ],
+   "source": [
+    "import profasta\n",
+    "\n",
+    "\n",
+    "def cleanup_protein_sequences(\n",
+    "        db: profasta.ProteinDatabase, alphabet=\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n",
+    "    ) -> None:\n",
+    "    \"\"\"Remove non-alphabet characters from protein sequences in the ProteinDatabase.\n",
+    "    \n",
+    "    Args:\n",
+    "        db: A profasta.ProteinDatabase instance.\n",
+    "        alphabet: List of characters that are allowed in the protein entry sequences.\n",
+    "    \"\"\"\n",
+    "    for entry in db.values():        \n",
+    "        entry.sequence = \"\".join([aa for aa in entry.sequence if aa in alphabet])\n",
+    "\n",
+    "\n",
+    "fasta_path = \"./uniprot_hsapiens_10entries.fasta\"\n",
+    "db = profasta.db.ProteinDatabase()\n",
+    "db.add_fasta(fasta_path, header_parser=\"uniprot\")\n",
+    "db[\"O75385\"].sequence = \"MEPG_-+123\"\n",
+    "cleanup_protein_sequences(db)\n",
+    "\n",
+    "print(db[\"O75385\"].sequence)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c5bea99",
+   "metadata": {},
+   "source": [
+    "## Converting FASTA headers into a UniProt like format"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "1996bde3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import profasta\n",
+    "import profasta.parser\n",
+    "\n",
+    "\n",
+    "class CustomHeaderParser:\n",
+    "    \"\"\"Custom header parser.\"\"\"\n",
+    "\n",
+    "    @classmethod\n",
+    "    def parse(cls, header: str) -> profasta.parser.ParsedHeader:\n",
+    "        \"\"\"Parse a FASTA header string into a ParsedHeader object.\n",
+    "        \n",
+    "        Header format example:\n",
+    "        >ProteinID hypothetical protein name\n",
+    "        \"\"\"\n",
+    "        split_header = header.split(maxsplit=1)\n",
+    "        _id = split_header[0]\n",
+    "\n",
+    "        fields = {\n",
+    "            \"db\": \"xx\",\n",
+    "            \"identifier\": _id,\n",
+    "            \"entry_name\": f\"{_id}_CUSTOM\",\n",
+    "            \"gene_name\": _id,\n",
+    "        }\n",
+    "        if len(split_header) > 1:\n",
+    "            fields[\"protein_name\"] = split_header[1]\n",
+    "        return profasta.parser.ParsedHeader(_id, header, fields)\n",
+    "\n",
+    "# Register the custom header parser so that it can be used by the ProteinDatabase.\n",
+    "profasta.parser.register_parser(\"custom_parser\", CustomHeaderParser)\n",
+    "\n",
+    "fasta_path = \"./custom_header_format.fasta\"\n",
+    "converted_fasta_path =  \"./custom_header_format.uniprot-like.fasta\"\n",
+    "protein_db = profasta.ProteinDatabase()\n",
+    "\n",
+    "# Specify the custom header parser to use for adding the FASTA file.\n",
+    "protein_db.add_fasta(fasta_path, header_parser=\"custom_parser\")\n",
+    "\n",
+    "# Write the ProteinDatabase to a new FASTA file using the uniprot-like header writer.\n",
+    "protein_db.write_fasta(converted_fasta_path, header_writer=\"uniprot_like\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "697f8065",
+   "metadata": {},
+   "source": [
+    "## Create a combined FASTA file with added decoy entries\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "bc15636a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import profasta\n",
+    "\n",
+    "fasta_path = \"./uniprot_hsapiens_10entries.fasta\"\n",
+    "decoy_fasta_path = \"./uniprot_hsapiens_10entries_DECOY.fasta\"\n",
+    "\n",
+    "# Import the FASTA file\n",
+    "db = profasta.db.ProteinDatabase()\n",
+    "db.add_fasta(fasta_path, header_parser=\"uniprot\")\n",
+    "\n",
+    "# Create the new FASTA file and write the original entries to it.\n",
+    "db.write_fasta(decoy_fasta_path, header_writer=\"uniprot\")\n",
+    "\n",
+    "# Create a decoy database from the original database, containing reversed sequences.\n",
+    "decoy_db = profasta.create_decoy_db(db, keep_nterm_methionine=True)\n",
+    "\n",
+    "# Append the decoy entries to the new FASTA file.\n",
+    "decoy_db.write_fasta(decoy_fasta_path, header_writer=\"decoy\", append=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/custom_header_format.fasta b/examples/custom_header_format.fasta
@@ -0,0 +1,3 @@
+>ProteinID hypothetical protein name
+MAWTPLFLFLLTCCPGSNSQAVVTQEPSLTVSPGGTVTLTCGSSTGAVTSGHYPYWFQQK
+PGQAPRTLIYDTSNKHSWTPARFSGSLLGGKAALTLLGAQPEDEAEYYCLLSYSGAR
diff --git a/examples/custom_header_format.uniprot-like.fasta b/examples/custom_header_format.uniprot-like.fasta
@@ -0,0 +1,3 @@
+>xx|ProteinID|ProteinID_CUSTOM hypothetical protein name GN=ProteinID
+MAWTPLFLFLLTCCPGSNSQAVVTQEPSLTVSPGGTVTLTCGSSTGAVTSGHYPYWFQQK
+PGQAPRTLIYDTSNKHSWTPARFSGSLLGGKAALTLLGAQPEDEAEYYCLLSYSGAR
diff --git a/...ple_data/uniprot_hsapiens_10entries.fasta → examples/uniprot_hsapiens_10entries.fasta b/...ple_data/uniprot_hsapiens_10entries.fasta → examples/uniprot_hsapiens_10entries.fasta