Skip to content

Commit e15020e

Browse files
authored
Use pydna classes for external imports and oligonucleotide hybridization (#378)
* wip towards simplifying import of external seqs * update pydna * better addgene * add oligo hybridization and improve common funciton, SEVA tests weirdly failing * update SEVA plasmids * test SEVA plasmid that does not have url in data.js * wip refactor snapgene * reformat wekwikgene and snapgene * simplify repository error handling * add support for OpenDNA collections * update data model version * improve test coverage + better handling of errors when parsing files * iGEM support and some fixes * iGEM support and some fixes * added euroscarg * working on genbank endpoint * improve error handling * working on genome region, exception handling makes tests fail * simplify http error handling * fix error handling + reduce reruns to 2 * fix vulns suggested by copilot * improve test coverage * improve test coverage
1 parent 7137e1d commit e15020e

31 files changed

+7505
-501
lines changed

.pre-commit-config.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,11 @@ repos:
3232
entry: python scripts/check_httpx_imports.py
3333
language: system
3434
files: \.py$
35-
exclude: tests/
35+
exclude: |
36+
(?x)^(
37+
tests/.*|
38+
scripts/.*|
39+
)$
3640
# Hook to ensure that primer3 is only imported in the primer3_functions.py file
3741
# This is to centralize how the settings etc. are handled.
3842
- repo: local

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,3 +153,14 @@ RECORD_STUBS=1 uvicorn opencloning.main:app --reload --reload-exclude='.venv'
153153
```
154154

155155
This will record the stubs (requests and responses) in the `stubs` folder.
156+
157+
158+
### Catalogs
159+
160+
Catalogs are used to map ids to urls for several plasmid collections. They are stored in the `src/opencloning/catalogs` folder.
161+
162+
To update the catalogs, run the following command:
163+
164+
```bash
165+
poetry run python scripts/update_catalogs.py
166+
```

docs/notebooks/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.json
Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "0",
6+
"metadata": {},
7+
"source": [
8+
"# Importing external sequences"
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": null,
14+
"id": "1",
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"from opencloning.ncbi_requests import get_annotations_from_query, get_genome_region_from_annotation, get_genbank_sequence\n",
19+
"from pydna.opencloning_models import CloningStrategy"
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"id": "2",
25+
"metadata": {},
26+
"source": [
27+
"## Importing genomic sequences\n",
28+
"\n",
29+
"You can import genomic sequences in different ways"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"id": "3",
35+
"metadata": {},
36+
"source": [
37+
"### Querying the annotation of a genome"
38+
]
39+
},
40+
{
41+
"cell_type": "markdown",
42+
"id": "4",
43+
"metadata": {},
44+
"source": [
45+
"Let's start by querying all the annotations of the _S. cerevisiae_ genome that contain the word \"aldolase\".\n",
46+
"\n",
47+
"We are using the assembly accession of the reference genome, which is `GCF_000146045.2`. If you want to find sequence accessions for your genome of interest, you can use the [NCBI Datasets page](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=460519).\n",
48+
"\n",
49+
"> Remember that this is an asynchronous function, so you need to use `await` to call it inside notebooks or async functions, and `asyncio.run` to call it in normal scripts."
50+
]
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": null,
55+
"id": "5",
56+
"metadata": {},
57+
"outputs": [],
58+
"source": [
59+
"annotations = await get_annotations_from_query('aldolase', 'GCF_000146045.2')"
60+
]
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"id": "6",
65+
"metadata": {},
66+
"source": [
67+
"Now let's check what the annotation contains. It has a lot of info!"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"id": "7",
74+
"metadata": {},
75+
"outputs": [
76+
{
77+
"data": {
78+
"text/plain": [
79+
"{'gene_id': '851888',\n",
80+
" 'symbol': 'DPL1',\n",
81+
" 'name': 'sphinganine-1-phosphate aldolase DPL1',\n",
82+
" 'gene_type': 'protein-coding',\n",
83+
" 'locus_tag': 'YDR294C',\n",
84+
" 'genomic_regions': [{'gene_range': {'accession_version': 'NC_001136.10',\n",
85+
" 'range': [{'begin': '1050459',\n",
86+
" 'end': '1052228',\n",
87+
" 'orientation': 'minus'}]}}],\n",
88+
" 'transcripts': [{'accession_version': 'NM_001180602.1',\n",
89+
" 'name': 'sphinganine-1-phosphate aldolase DPL1',\n",
90+
" 'cds': {'accession_version': 'NM_001180602.1'},\n",
91+
" 'genomic_locations': [{'genomic_accession_version': 'NC_001136.10',\n",
92+
" 'genomic_range': {'begin': '1050459',\n",
93+
" 'end': '1052228',\n",
94+
" 'orientation': 'minus'}}],\n",
95+
" 'protein': {'accession_version': 'NP_010580.1',\n",
96+
" 'name': 'sphinganine-1-phosphate aldolase DPL1',\n",
97+
" 'length': 589}}],\n",
98+
" 'chromosomes': ['IV'],\n",
99+
" 'annotations': [{'assembly_accession': 'GCF_000146045.2'}]}"
100+
]
101+
},
102+
"execution_count": null,
103+
"metadata": {},
104+
"output_type": "execute_result"
105+
}
106+
],
107+
"source": [
108+
"annotations[0]"
109+
]
110+
},
111+
{
112+
"cell_type": "markdown",
113+
"id": "8",
114+
"metadata": {},
115+
"source": [
116+
"Let's see what annotations we got:"
117+
]
118+
},
119+
{
120+
"cell_type": "code",
121+
"execution_count": null,
122+
"id": "9",
123+
"metadata": {},
124+
"outputs": [
125+
{
126+
"name": "stdout",
127+
"output_type": "stream",
128+
"text": [
129+
"0: YDR294C - sphinganine-1-phosphate aldolase DPL1\n",
130+
"1: YEL046C - threonine aldolase GLY1\n",
131+
"2: YER010C - bifunctional 4-hydroxy-4-methyl-2-oxoglutarate aldolase/oxaloacetate decarboxylase\n",
132+
"3: YGR043C - sedoheptulose-7-phosphate:D-glyceraldehyde-3-phosphate transaldolase NQM1\n",
133+
"4: YKL060C - fructose-bisphosphate aldolase FBA1\n",
134+
"5: YLR354C - sedoheptulose-7-phosphate:D-glyceraldehyde-3-phosphate transaldolase TAL1\n",
135+
"6: YNL256W - trifunctional dihydropteroate synthetase/dihydrohydroxymethylpterin pyrophosphokinase/dihydroneopterin aldolase FOL1\n"
136+
]
137+
}
138+
],
139+
"source": [
140+
"for i, annotation in enumerate(annotations):\n",
141+
" print(f'{i}: {annotation[\"locus_tag\"]} - {annotation[\"name\"]}')\n"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"id": "10",
147+
"metadata": {},
148+
"source": [
149+
"Great! Now let's say that what we are interested in is the `threonine aldolase GLY1` gene.\n",
150+
"We can get the sequence of this locus by using the `get_genome_region_from_annotation` function, we can provide padding to the left and right of the gene to also get neighbouring regions and not just the gene itself."
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": null,
156+
"id": "11",
157+
"metadata": {},
158+
"outputs": [],
159+
"source": [
160+
"gly1 = await get_genome_region_from_annotation(annotations[0], 1000, 1000)"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"id": "12",
166+
"metadata": {},
167+
"source": [
168+
"We can also see what information is stored in the source of the sequence, and display it's history."
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"id": "13",
175+
"metadata": {},
176+
"outputs": [
177+
{
178+
"name": "stdout",
179+
"output_type": "stream",
180+
"text": [
181+
"input=[] assembly_accession='GCF_000146045.2' sequence_accession='NC_001136.10' locus_tag='YDR294C' gene_id=851888 start=1049459 end=1053228 strand=-1\n",
182+
"None\n",
183+
"╙── YDR294C (Dseqrecord(-3770))\n",
184+
" └─╼ GenomeCoordinatesSource\n"
185+
]
186+
}
187+
],
188+
"source": [
189+
"print(print(gly1.source))\n",
190+
"print(gly1.history())\n"
191+
]
192+
},
193+
{
194+
"cell_type": "markdown",
195+
"id": "14",
196+
"metadata": {},
197+
"source": [
198+
"You can also save it to a json file and open it in [OpenCloning](https://app.opencloning.org), and it would look like this:\n",
199+
"\n",
200+
"<img src=\"images/gly1_opencloning.png\" width=\"300\"/>"
201+
]
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": null,
206+
"id": "15",
207+
"metadata": {},
208+
"outputs": [],
209+
"source": [
210+
"cs = CloningStrategy.from_dseqrecords([gly1])\n",
211+
"\n",
212+
"with open(\"gly_history.json\", \"w\") as f:\n",
213+
" f.write(cs.model_dump_json())\n"
214+
]
215+
},
216+
{
217+
"cell_type": "markdown",
218+
"id": "16",
219+
"metadata": {},
220+
"source": [
221+
"### Using genome coordinates\n",
222+
"\n",
223+
"For this, we need the sequence accession (not the assembly accession) of the chromosome of interest.\n",
224+
"\n",
225+
"You can search for the sequence accession of interest in the [NCBI Datasets page](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000146045.2), at the `Chromosomes` section. You can also do this programmatically using the Datasets API, check how it is used in the function `get_sequence_accessions_from_assembly_accession`.\n",
226+
"\n",
227+
"For our example, we are interested in chromosome IV, and the sequence accession is `NC_001136.10`. Let's take the same coordinates as before, and get the same gly1 sequence.\n"
228+
]
229+
},
230+
{
231+
"cell_type": "code",
232+
"execution_count": null,
233+
"id": "17",
234+
"metadata": {},
235+
"outputs": [
236+
{
237+
"name": "stdout",
238+
"output_type": "stream",
239+
"text": [
240+
"They are the same sequence: True\n"
241+
]
242+
}
243+
],
244+
"source": [
245+
"gly1_from_coordinates = await get_genbank_sequence('NC_001136.10', start=1049459, end=1053228, strand=-1)\n",
246+
"\n",
247+
"print(\"They are the same sequence:\", gly1_from_coordinates.seq == gly1.seq)"
248+
]
249+
},
250+
{
251+
"cell_type": "markdown",
252+
"id": "18",
253+
"metadata": {},
254+
"source": [
255+
"Notice that in this case, the source will not contain gene id or locus tag."
256+
]
257+
},
258+
{
259+
"cell_type": "code",
260+
"execution_count": null,
261+
"id": "19",
262+
"metadata": {},
263+
"outputs": [
264+
{
265+
"data": {
266+
"text/plain": [
267+
"RepositoryIdSource(input=[], repository_id='NC_001136.10', repository_name='genbank')"
268+
]
269+
},
270+
"execution_count": null,
271+
"metadata": {},
272+
"output_type": "execute_result"
273+
}
274+
],
275+
"source": [
276+
"gly1_from_coordinates.source"
277+
]
278+
}
279+
],
280+
"metadata": {
281+
"kernelspec": {
282+
"display_name": ".venv",
283+
"language": "python",
284+
"name": "python3"
285+
},
286+
"language_info": {
287+
"codemirror_mode": {
288+
"name": "ipython",
289+
"version": 3
290+
},
291+
"file_extension": ".py",
292+
"mimetype": "text/x-python",
293+
"name": "python",
294+
"nbconvert_exporter": "python",
295+
"pygments_lexer": "ipython3",
296+
"version": "3.12.11"
297+
}
298+
},
299+
"nbformat": 4,
300+
"nbformat_minor": 5
301+
}
70.6 KB
Loading

0 commit comments

Comments
 (0)