Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates/cleanup #34

Merged
merged 5 commits into from
Mar 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ dist/

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Django stuff
staticfiles/
Expand All @@ -34,4 +33,6 @@ env/
k8s/configmap_*.yaml
prod.env

tmp/
# Temporary files
tmp/
*.log
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,6 @@ See the [OpenAPI specification](https://editor.swagger.io/?url=https://raw.githu
### Data updates

The `track/:track_id` REST endpoint supports `DELETE`/`POST` requests for adding/removing track entries.
For bulk/automated updates, use `./utils/submit_track_templates.py` script. See the accompanied readme for more details.
For bulk/automated updates, use `./utils/submit_tracks.py` script. See the accompanied readme for more details.


95 changes: 0 additions & 95 deletions templates/gene-track-desc-mvp.csv

This file was deleted.

2,739 changes: 0 additions & 2,739 deletions templates/gene-track-desc.csv

This file was deleted.

49 changes: 20 additions & 29 deletions utils/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,26 @@
## Utility scripts
## Bulk data updater for Track API

### Bulk data updater
The `submit_tracks.py` script constructs and submits track payloads to the Track API `POST` endpoint to add new track records.
It uses the yaml templates in `/templates` and some supplemental data to construct the payloads (see below for more details).
The only required parameters are `--release` (release id) and `--env` (environment name, defaults to 'dev').
`--release` sets the list of genomes to be loaded, and `--env` sets the path to track datafiles dir and Track API endpoint URL.
The list of submitted genomes and tracks can also be specified with `--genomes`, `--templates` and `--files` params.
The data directory path and Track API URL values derived from `--env` can be overridden with environment variables, in which case the `--env` param can be omitted (see the example below).

`submit_track_templates.py` script constructs and submits track payloads to Track API `POST` endpoint to add new tracks.
It uses the yaml templates in `/templates` to fill the payload fields. You specify the submitted track types (templates) with `--template` parameter,
or by pointing it to a data directory (by setting `TRACK_API_DIR` environment variable). In the latter case the script tries to sync the data directory
to Track API by matching each datafile (with `.bb`, `.bw` extension) with corresponding template file: a template filename must match or start with the datafile name to be submitted.
The datafile parent directory names are used to fill in the genome UUID field (override via `--genome` param).
### Track template selection and filling
The script infers the list of tracks to submit for each genome from the datafile names in the data dir, matching the name of datafiles (with `.bb` or `.bw` file extension) to corresponding template filenames as follows:
- exact match: single template per datafile (e.g. `gc` track)
- template name starts with datafile name: multiple tracks per datafile (e.g. 4 gene tracks from `transcripts.bb`)
- datafile name starts with template name: fallback template (e.g. `variant.yaml` template is used for all `variant*.bb` datafiles that don't have exact template match like `variant-eva-details`)

The datafile directory path can be replaced with an explicit list of template and/or datafile names via `--templates` or `--files`. Note that in this case the script doesn't check the presence of datafiles.

For most cases the track template (i.e. track type, derived from the datafile name) define all the necessary fields in the track payload. The fields/values in the template (e.g. track label, category, description etc.) can be changed by updating the template in the github repo. An exception is the description field for gene and variation tracks, which varies depending on the species and is populated at the time of track submission from the Metadata DB (see `get_gene_desc.py`) or a CSV file (`/templates/variant-track-desc.csv`).

Example:
```bash
export TRACK_API_URL=http://localhost:8000 # target track API
export TRACK_DATA_DIR=/Users/Alice/datafiles # track datafiles
./utils/submit_track_templates.py #submit tracks for all datafiles in data dir
export TRACK_DATA_DIR=/Users/Alice/datafiles # override track datafiles location
export TRACK_API_URL=http://localhost:8000 # override target track API URL
./utils/submit_tracks.py --release 5 #submit all tracks for this relase to local endpoint
```
For more detailed instructions for production, refer to [ENSWEBSOPS-171](https://www.ebi.ac.uk/panda/jira/browse/ENSWEBSOPS-171).


#### Data sources
Data submission script combines input from multiple sources:
| Source | Where | Why |
|-------|-------|------|
| Templates | `/tempaltes/*.yaml` | Track payload base |
| Overrides | `/tempaltes/*.csv` | Species-specific values |
| Datafiles | Input data dir | Specify species/tracks to submit |

- The datafiles source can be replaced/modified with command-line params (`-g`,`-t`)
- Gene track overrides (`gene-track-desc.csv`) was generated/updated from Metadata API with `utils/get_analysis_descs.py` script. `variation-track-desc.csv` was converted from input JSON (handed over from Variation team).

### Legacy scripts
Kept for historical record, safe to delete.
- import_data.py: legacy data importer script; useful as an example of updating tracks database through Django datamodels
- grant_access.sh: useful when read/write database users don't have a shared db schema
- submit_tracks_241: initial track submission script
For more detailed instructions for running the track loading script, refer to [ENSWEBSOPS-171](https://www.ebi.ac.uk/panda/jira/browse/ENSWEBSOPS-171).
Empty file removed utils/__init__.py
Empty file.
92 changes: 31 additions & 61 deletions utils/get_analysis_descs_mvp.py → utils/get_gene_track_desc.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
out of core DBs to feed into the new genome browser (MVP/Beta).

Species scope is determined by appropriate choice of the 'metadata DB'
and the RELEASE ID. This latter one must be given by Production.
E.g. RELEASE_ID = 3 represents 'beta-3'
and the 'release ID'. This latter one must be given by Automation.
E.g. release_id = 3 represents 'beta-3'
The species list can optionally be narrowed down with genome IDs.

WARNING: it contains hardcoded data. Barely tested.
USE AT YOUR OWN RISK!
Expand All @@ -14,20 +15,15 @@

from dataclasses import dataclass
from string import Template
import csv
from mysql.connector import connection
from mysql.connector.connection import MySQLConnection

# CHECK cfg below with Production!!!
# CHECK cfg below with Automation!!!
META_HOST = "mysql-ens-production-1"
META_PORT = 4721
META_DB = "ensembl_genome_metadata"
RELEASE_ID = 4 # this is a magic number known by Production

HOST = "mysql-ens-sta-6.ebi.ac.uk"
PORT = 4695
CSV_DELIMITER = ","
OUTFILENAME = "gene-track-desc-mvp.csv"


@dataclass
class SrcInfo:
Expand All @@ -38,8 +34,8 @@ class SrcInfo:
is_ensembl_anno: bool


def get_metadb_connection(username="ensro", password=""):
return connection.MySQLConnection(
def get_metadb_connection(username="ensro", password="") -> MySQLConnection:
return MySQLConnection(
user=username,
password=password,
host=META_HOST,
Expand All @@ -48,22 +44,27 @@ def get_metadb_connection(username="ensro", password=""):
)


def get_connection(username, password, dbname=None):
def get_connection(username:str, password:str, dbname:str|None=None) -> MySQLConnection:
if dbname is None:
return connection.MySQLConnection(
return MySQLConnection(
user=username, password=password, host=HOST, port=PORT
)
return connection.MySQLConnection(
return MySQLConnection(
user=username, password=password, host=HOST, port=PORT, database=dbname
)


def get_ensro_connection(dbname=None):
def get_ensro_connection(dbname:str|None=None):
return get_connection(username="ensro", password="", dbname=dbname)


def get_dbs(conx):
def get_dbs(conx, release=int, genomes: list[str]|None=None) -> list:
cursor = conx.cursor()
if genomes is None:
genomes_str = ""
else:
genomes_str = ",".join([f"'{uuid}'" for uuid in genomes])
genomes_str = f"and g.genome_uuid in ({genomes_str})"
t = Template(
"""select g.production_name,g.genome_uuid,dss.name from genome g
join genome_release gr using(genome_id)
Expand All @@ -73,14 +74,15 @@ def get_dbs(conx):
where gr.release_id = $release_id
and gd.release_id = $release_id
and ds.dataset_type_id = 2
$genomes_str
"""
)
cursor.execute(t.substitute(release_id=RELEASE_ID))
cursor.execute(t.substitute(release_id=release, genomes_str=genomes_str))
dbs = cursor.fetchall()
return dbs


def get_analysis_src_info(conx, dbname) -> SrcInfo:
def get_analysis_src_info(conx:MySQLConnection, dbname:str) -> SrcInfo:
cursor = conx.cursor()

t = Template(
Expand All @@ -100,57 +102,25 @@ def get_analysis_src_info(conx, dbname) -> SrcInfo:
return SrcInfo(source_name=r[2], source_url=r[3], is_ensembl_anno=is_ensembl_anno)


def dump_data(data: list[dict[str, str]], filename: str = OUTFILENAME) -> None:
with open(filename, "w", newline="", encoding="utf-8") as csvfile:
fieldnames = [
"Species",
"Genome_UUID",
"DB_name",
"Source_name",
"Source_URL",
"Description",
]
writer = csv.DictWriter(
csvfile,
fieldnames=fieldnames,
delimiter=CSV_DELIMITER,
quoting=csv.QUOTE_MINIMAL,
dialect="unix",
)
writer.writeheader()
for item in data:
writer.writerow(item)


def main():
def main(release:int, genomes:list[str]|None=None) -> dict[str, dict]:
conx = get_metadb_connection()
dbs = get_dbs(conx)
dbs = get_dbs(conx, release=release, genomes=genomes)
conx.close()

conx = get_ensro_connection()
descriptions = {}

descriptions = []

print(f"Found {len(dbs)} genomes{f' (out of {len(genomes)} requested)' if genomes else ''} for release {release}.")
for db in dbs:
dbname = db[2]
print(f"Working on: {dbname}")
#print(f"Working on: {dbname}")
src_info = get_analysis_src_info(conx, dbname)
ensembl_imported = "Annotated" if src_info.is_ensembl_anno else "Imported"
descriptions.append(
{
"Species": db[0],
"Genome_UUID": db[1],
"DB_name": dbname,
"Source_name": src_info.source_name,
"Source_URL": src_info.source_url,
"Description": ensembl_imported,
}
)
descriptions[db[1]] = {
"source_names": [src_info.source_name],
"source_urls": [src_info.source_url],
"description": ensembl_imported,
}

conx.close()

dump_data(descriptions)


if __name__ == "__main__":
main()
return descriptions
4 changes: 0 additions & 4 deletions utils/grant_access.sh

This file was deleted.

103 changes: 0 additions & 103 deletions utils/import_data.py

This file was deleted.

4 changes: 4 additions & 0 deletions utils/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
mysql-connector-python>=9.0
PyYAML>=6.0
requests>=2.20
typing-extensions>=4.10
Loading