manulera
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 5 additions & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 11 additions & 0 deletions b/‎README.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎docs/notebooks/.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎docs/notebooks/.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/notebooks/external_sequences.ipynb‎
Lines changed: 301 additions & 0 deletions b/‎docs/notebooks/external_sequences.ipynb‎
Lines changed: 301 additions & 0 deletions
diff --git a/‎docs/notebooks/images/gly1_opencloning.png‎
70.6 KB b/‎docs/notebooks/images/gly1_opencloning.png‎
70.6 KB
@@ -32,7 +32,11 @@ repos:
         entry: python scripts/check_httpx_imports.py
         language: system
         files: \.py$
-        exclude: tests/
+        exclude: |
+            (?x)^(
+                tests/.*|
+                scripts/.*|
+            )$
   # Hook to ensure that primer3 is only imported in the primer3_functions.py file
   # This is to centralize how the settings etc. are handled.
   - repo: local
 
@@ -153,3 +153,14 @@ RECORD_STUBS=1 uvicorn opencloning.main:app --reload --reload-exclude='.venv'
 ```
 
 This will record the stubs (requests and responses) in the `stubs` folder.
+
+
+### Catalogs
+
+Catalogs are used to map ids to urls for several plasmid collections. They are stored in the `src/opencloning/catalogs` folder.
+
+To update the catalogs, run the following command:
+
+```bash
+poetry run python scripts/update_catalogs.py
+```
@@ -0,0 +1 @@
+*.json
@@ -0,0 +1,301 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0",
+   "metadata": {},
+   "source": [
+    "# Importing external sequences"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from opencloning.ncbi_requests import get_annotations_from_query, get_genome_region_from_annotation, get_genbank_sequence\n",
+    "from pydna.opencloning_models import CloningStrategy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2",
+   "metadata": {},
+   "source": [
+    "## Importing genomic sequences\n",
+    "\n",
+    "You can import genomic sequences in different ways"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3",
+   "metadata": {},
+   "source": [
+    "### Querying the annotation of a genome"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4",
+   "metadata": {},
+   "source": [
+    "Let's start by querying all the annotations of the _S. cerevisiae_ genome that contain the word \"aldolase\".\n",
+    "\n",
+    "We are using the assembly accession of the reference genome, which is `GCF_000146045.2`. If you want to find sequence accessions for your genome of interest, you can use the [NCBI Datasets page](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=460519).\n",
+    "\n",
+    "> Remember that this is an asynchronous function, so you need to use `await` to call it inside notebooks or async functions, and `asyncio.run` to call it in normal scripts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "annotations = await get_annotations_from_query('aldolase', 'GCF_000146045.2')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6",
+   "metadata": {},
+   "source": [
+    "Now let's check what the annotation contains. It has a lot of info!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'gene_id': '851888',\n",
+       " 'symbol': 'DPL1',\n",
+       " 'name': 'sphinganine-1-phosphate aldolase DPL1',\n",
+       " 'gene_type': 'protein-coding',\n",
+       " 'locus_tag': 'YDR294C',\n",
+       " 'genomic_regions': [{'gene_range': {'accession_version': 'NC_001136.10',\n",
+       "    'range': [{'begin': '1050459',\n",
+       "      'end': '1052228',\n",
+       "      'orientation': 'minus'}]}}],\n",
+       " 'transcripts': [{'accession_version': 'NM_001180602.1',\n",
+       "   'name': 'sphinganine-1-phosphate aldolase DPL1',\n",
+       "   'cds': {'accession_version': 'NM_001180602.1'},\n",
+       "   'genomic_locations': [{'genomic_accession_version': 'NC_001136.10',\n",
+       "     'genomic_range': {'begin': '1050459',\n",
+       "      'end': '1052228',\n",
+       "      'orientation': 'minus'}}],\n",
+       "   'protein': {'accession_version': 'NP_010580.1',\n",
+       "    'name': 'sphinganine-1-phosphate aldolase DPL1',\n",
+       "    'length': 589}}],\n",
+       " 'chromosomes': ['IV'],\n",
+       " 'annotations': [{'assembly_accession': 'GCF_000146045.2'}]}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "annotations[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8",
+   "metadata": {},
+   "source": [
+    "Let's see what annotations we got:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0: YDR294C - sphinganine-1-phosphate aldolase DPL1\n",
+      "1: YEL046C - threonine aldolase GLY1\n",
+      "2: YER010C - bifunctional 4-hydroxy-4-methyl-2-oxoglutarate aldolase/oxaloacetate decarboxylase\n",
+      "3: YGR043C - sedoheptulose-7-phosphate:D-glyceraldehyde-3-phosphate transaldolase NQM1\n",
+      "4: YKL060C - fructose-bisphosphate aldolase FBA1\n",
+      "5: YLR354C - sedoheptulose-7-phosphate:D-glyceraldehyde-3-phosphate transaldolase TAL1\n",
+      "6: YNL256W - trifunctional dihydropteroate synthetase/dihydrohydroxymethylpterin pyrophosphokinase/dihydroneopterin aldolase FOL1\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i, annotation in enumerate(annotations):\n",
+    "    print(f'{i}: {annotation[\"locus_tag\"]} - {annotation[\"name\"]}')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10",
+   "metadata": {},
+   "source": [
+    "Great! Now let's say that what we are interested in is the `threonine aldolase GLY1` gene.\n",
+    "We can get the sequence of this locus by using the `get_genome_region_from_annotation` function, we can provide padding to the left and right of the gene to also get neighbouring regions and not just the gene itself."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gly1 = await get_genome_region_from_annotation(annotations[0], 1000, 1000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12",
+   "metadata": {},
+   "source": [
+    "We can also see what information is stored in the source of the sequence, and display it's history."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "13",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "input=[] assembly_accession='GCF_000146045.2' sequence_accession='NC_001136.10' locus_tag='YDR294C' gene_id=851888 start=1049459 end=1053228 strand=-1\n",
+      "None\n",
+      "╙── YDR294C (Dseqrecord(-3770))\n",
+      "    └─╼ GenomeCoordinatesSource\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(print(gly1.source))\n",
+    "print(gly1.history())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14",
+   "metadata": {},
+   "source": [
+    "You can also save it to a json file and open it in [OpenCloning](https://app.opencloning.org), and it would look like this:\n",
+    "\n",
+    "<img src=\"images/gly1_opencloning.png\" width=\"300\"/>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cs = CloningStrategy.from_dseqrecords([gly1])\n",
+    "\n",
+    "with open(\"gly_history.json\", \"w\") as f:\n",
+    "    f.write(cs.model_dump_json())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "16",
+   "metadata": {},
+   "source": [
+    "### Using genome coordinates\n",
+    "\n",
+    "For this, we need the sequence accession (not the assembly accession) of the chromosome of interest.\n",
+    "\n",
+    "You can search for the sequence accession of interest in the [NCBI Datasets page](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000146045.2), at the `Chromosomes` section. You can also do this programmatically using the Datasets API, check how it is used in the function `get_sequence_accessions_from_assembly_accession`.\n",
+    "\n",
+    "For our example, we are interested in chromosome IV, and the sequence accession is `NC_001136.10`. Let's take the same coordinates as before, and get the same gly1 sequence.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "17",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "They are the same sequence: True\n"
+     ]
+    }
+   ],
+   "source": [
+    "gly1_from_coordinates = await get_genbank_sequence('NC_001136.10', start=1049459, end=1053228, strand=-1)\n",
+    "\n",
+    "print(\"They are the same sequence:\", gly1_from_coordinates.seq == gly1.seq)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18",
+   "metadata": {},
+   "source": [
+    "Notice that in this case, the source will not contain gene id or locus tag."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "19",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "RepositoryIdSource(input=[], repository_id='NC_001136.10', repository_name='genbank')"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "gly1_from_coordinates.source"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}