Ensembl · veidenberg · Mar 25, 2025 · Feb 11, 2025 · Mar 12, 2025 · Mar 13, 2025
diff --git a/.gitignore b/.gitignore
@@ -10,7 +10,6 @@ dist/
 
 # Installer logs
 pip-log.txt
-pip-delete-this-directory.txt
 
 # Django stuff
 staticfiles/
@@ -34,4 +33,6 @@ env/
 k8s/configmap_*.yaml
 prod.env
 
-tmp/
+# Temporary files
+tmp/
+*.log
diff --git a/README.md b/README.md
@@ -39,6 +39,6 @@ See the [OpenAPI specification](https://editor.swagger.io/?url=https://raw.githu
 ### Data updates
 
 The `track/:track_id` REST endpoint supports `DELETE`/`POST` requests for adding/removing track entries. 
-For bulk/automated updates, use `./utils/submit_track_templates.py` script. See the accompanied readme for more details.
+For bulk/automated updates, use `./utils/submit_tracks.py` script. See the accompanied readme for more details.
 
 
diff --git a/templates/gene-track-desc-mvp.csv b/templates/gene-track-desc-mvp.csv
diff --git a/templates/gene-track-desc.csv b/templates/gene-track-desc.csv
diff --git a/utils/README.md b/utils/README.md
@@ -1,35 +1,26 @@
-## Utility scripts
+## Bulk data updater for Track API
 
-### Bulk data updater
+The `submit_tracks.py` script constructs and submits track payloads to the Track API `POST` endpoint to add new track records.
+It uses the yaml templates in `/templates` and some supplemental data to construct the payloads (see below for more details).
+The only required parameters are `--release` (release id) and `--env` (environment name, defaults to 'dev').
+`--release` sets the list of genomes to be loaded, and `--env` sets the path to track datafiles dir and Track API endpoint URL.
+The list of submitted genomes and tracks can also be specified with `--genomes`, `--templates` and `--files` params.
+The data directory path and Track API URL values derived from `--env` can be overridden with environment variables, in which case the `--env` param can be omitted (see the example below).
 
-`submit_track_templates.py` script constructs and submits track payloads to Track API `POST` endpoint to add new tracks.
-It uses the yaml templates in `/templates` to fill the payload fields. You specify the submitted track types (templates) with  `--template` parameter, 
-or by pointing it to a data directory (by setting `TRACK_API_DIR` environment variable). In the latter case the script tries to sync the data directory
-to Track API by matching each datafile (with `.bb`, `.bw` extension) with corresponding template file: a template filename must match or start with the datafile name to be submitted.
-The datafile parent directory names are used to fill in the genome UUID field (override via `--genome` param).
+### Track template selection and filling
+The script infers the list of tracks to submit for each genome from the datafile names in the data dir, matching the name of datafiles (with `.bb` or `.bw` file extension) to corresponding template filenames as follows: 
+- exact match: single template per datafile (e.g. `gc` track)
+- template name starts with datafile name: multiple tracks per datafile (e.g. 4 gene tracks from `transcripts.bb`)
+- datafile name starts with template name: fallback template (e.g. `variant.yaml` template is used for all `variant*.bb` datafiles that don't have exact template match like `variant-eva-details`)
+
+The datafile directory path can be replaced with an explicit list of template and/or datafile names via `--templates` or `--files`. Note that in this case the script doesn't check the presence of datafiles.
+
+For most cases the track template (i.e. track type, derived from the datafile name) define all the necessary fields in the track payload. The fields/values in the template (e.g. track label, category, description etc.) can be changed by updating the template in the github repo. An exception is the description field for gene and variation tracks, which varies depending on the species and is populated at the time of track submission from the Metadata DB (see `get_gene_desc.py`) or a CSV file (`/templates/variant-track-desc.csv`).
 
 Example:
 ```bash
-export TRACK_API_URL=http://localhost:8000 # target track API
-export TRACK_DATA_DIR=/Users/Alice/datafiles # track datafiles
-./utils/submit_track_templates.py #submit tracks for all datafiles in data dir
+export TRACK_DATA_DIR=/Users/Alice/datafiles # override track datafiles location
+export TRACK_API_URL=http://localhost:8000 # override target track API URL
+./utils/submit_tracks.py --release 5 #submit all tracks for this relase to local endpoint
 ```
-For more detailed instructions for production, refer to [ENSWEBSOPS-171](https://www.ebi.ac.uk/panda/jira/browse/ENSWEBSOPS-171).
-
-
-#### Data sources
-Data submission script combines input from multiple sources:
-| Source | Where | Why |
-|-------|-------|------|
-| Templates | `/tempaltes/*.yaml` | Track payload base |
-| Overrides | `/tempaltes/*.csv` | Species-specific values |
-| Datafiles | Input data dir | Specify species/tracks to submit |
-
-- The datafiles source can be replaced/modified with command-line params (`-g`,`-t`)
-- Gene track overrides (`gene-track-desc.csv`) was generated/updated from Metadata API with `utils/get_analysis_descs.py` script. `variation-track-desc.csv` was converted from input JSON (handed over from Variation team).
-
-### Legacy scripts
-Kept for historical record, safe to delete.
-- import_data.py: legacy data importer script; useful as an example of updating tracks database through Django datamodels 
-- grant_access.sh: useful when read/write database users don't have a shared db schema
-- submit_tracks_241: initial track submission script
+For more detailed instructions for running the track loading script, refer to [ENSWEBSOPS-171](https://www.ebi.ac.uk/panda/jira/browse/ENSWEBSOPS-171).
diff --git a/utils/__init__.py b/utils/__init__.py
diff --git a/utils/get_analysis_descs_mvp.py → utils/get_gene_track_desc.py b/utils/get_analysis_descs_mvp.py → utils/get_gene_track_desc.py
@@ -3,8 +3,9 @@
 out of core DBs to feed into the new genome browser (MVP/Beta).
 
 Species scope is determined by appropriate choice of the 'metadata DB'
-and the RELEASE ID. This latter one must be given by Production.
-E.g. RELEASE_ID = 3 represents 'beta-3'
+and the 'release ID'. This latter one must be given by Automation.
+E.g. release_id = 3 represents 'beta-3'
+The species list can optionally be narrowed down with genome IDs.
 
 WARNING: it contains hardcoded data. Barely tested.
 USE AT YOUR OWN RISK!
@@ -14,20 +15,15 @@
 
 from dataclasses import dataclass
 from string import Template
-import csv
-from mysql.connector import connection
+from mysql.connector.connection import MySQLConnection
 
-# CHECK cfg below with Production!!!
+# CHECK cfg below with Automation!!!
 META_HOST = "mysql-ens-production-1"
 META_PORT = 4721
 META_DB = "ensembl_genome_metadata"
-RELEASE_ID = 4  # this is a magic number known by Production
 
 HOST = "mysql-ens-sta-6.ebi.ac.uk"
 PORT = 4695
-CSV_DELIMITER = ","
-OUTFILENAME = "gene-track-desc-mvp.csv"
-
 
 @dataclass
 class SrcInfo:
@@ -38,8 +34,8 @@ class SrcInfo:
     is_ensembl_anno: bool
 
 
-def get_metadb_connection(username="ensro", password=""):
-    return connection.MySQLConnection(
+def get_metadb_connection(username="ensro", password="") -> MySQLConnection:
+    return MySQLConnection(
         user=username,
         password=password,
         host=META_HOST,
@@ -48,22 +44,27 @@ def get_metadb_connection(username="ensro", password=""):
     )
 
 
-def get_connection(username, password, dbname=None):
+def get_connection(username:str, password:str, dbname:str|None=None) -> MySQLConnection:
     if dbname is None:
-        return connection.MySQLConnection(
+        return MySQLConnection(
             user=username, password=password, host=HOST, port=PORT
         )
-    return connection.MySQLConnection(
+    return MySQLConnection(
         user=username, password=password, host=HOST, port=PORT, database=dbname
     )
 
 
-def get_ensro_connection(dbname=None):
+def get_ensro_connection(dbname:str|None=None):
     return get_connection(username="ensro", password="", dbname=dbname)
 
 
-def get_dbs(conx):
+def get_dbs(conx, release=int, genomes: list[str]|None=None) -> list:
     cursor = conx.cursor()
+    if genomes is None:
+        genomes_str = ""
+    else:
+        genomes_str = ",".join([f"'{uuid}'" for uuid in genomes])
+        genomes_str = f"and g.genome_uuid in ({genomes_str})"
     t = Template(
         """select g.production_name,g.genome_uuid,dss.name from genome g
         join genome_release gr using(genome_id)
@@ -73,14 +74,15 @@ def get_dbs(conx):
         where gr.release_id = $release_id
         and gd.release_id = $release_id
         and ds.dataset_type_id = 2
+        $genomes_str
         """
     )
-    cursor.execute(t.substitute(release_id=RELEASE_ID))
+    cursor.execute(t.substitute(release_id=release, genomes_str=genomes_str))
     dbs = cursor.fetchall()
     return dbs
 
 
-def get_analysis_src_info(conx, dbname) -> SrcInfo:
+def get_analysis_src_info(conx:MySQLConnection, dbname:str) -> SrcInfo:
     cursor = conx.cursor()
 
     t = Template(
@@ -100,57 +102,25 @@ def get_analysis_src_info(conx, dbname) -> SrcInfo:
     return SrcInfo(source_name=r[2], source_url=r[3], is_ensembl_anno=is_ensembl_anno)
 
 
-def dump_data(data: list[dict[str, str]], filename: str = OUTFILENAME) -> None:
-    with open(filename, "w", newline="", encoding="utf-8") as csvfile:
-        fieldnames = [
-            "Species",
-            "Genome_UUID",
-            "DB_name",
-            "Source_name",
-            "Source_URL",
-            "Description",
-        ]
-        writer = csv.DictWriter(
-            csvfile,
-            fieldnames=fieldnames,
-            delimiter=CSV_DELIMITER,
-            quoting=csv.QUOTE_MINIMAL,
-            dialect="unix",
-        )
-        writer.writeheader()
-        for item in data:
-            writer.writerow(item)
-
-
-def main():
+def main(release:int, genomes:list[str]|None=None) -> dict[str, dict]:
     conx = get_metadb_connection()
-    dbs = get_dbs(conx)
+    dbs = get_dbs(conx, release=release, genomes=genomes)
     conx.close()
-
     conx = get_ensro_connection()
+    descriptions = {}
 
-    descriptions = []
-
+    print(f"Found {len(dbs)} genomes{f' (out of {len(genomes)} requested)' if genomes else ''} for release {release}.")
     for db in dbs:
         dbname = db[2]
-        print(f"Working on: {dbname}")
+        #print(f"Working on: {dbname}")
         src_info = get_analysis_src_info(conx, dbname)
         ensembl_imported = "Annotated" if src_info.is_ensembl_anno else "Imported"
-        descriptions.append(
-            {
-                "Species": db[0],
-                "Genome_UUID": db[1],
-                "DB_name": dbname,
-                "Source_name": src_info.source_name,
-                "Source_URL": src_info.source_url,
-                "Description": ensembl_imported,
-            }
-        )
+        descriptions[db[1]] = {
+            "source_names": [src_info.source_name],
+            "source_urls": [src_info.source_url],
+            "description": ensembl_imported,
+        }
 
     conx.close()
 
-    dump_data(descriptions)
-
-
-if __name__ == "__main__":
-    main()
+    return descriptions
diff --git a/utils/grant_access.sh b/utils/grant_access.sh
diff --git a/utils/import_data.py b/utils/import_data.py
diff --git a/utils/requirements.txt b/utils/requirements.txt
@@ -0,0 +1,4 @@
+mysql-connector-python>=9.0
+PyYAML>=6.0
+requests>=2.20
+typing-extensions>=4.10