Skip to content

FordyceLab/TFDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlexKRotation: Notebooks and scripts for database collecting

Schema:

Schema

Quick Start:

This will allow you to build and clean the database completely from JSON files contained in this repo.

  1. Run database_builder.ipynb (changing the directory/name as you see fit).

Optional: These will put in missing protein sequences and uniprot IDs, and fix some broken uniprot IDs from Jaspar.

  1. Run missing_uniprot_ids.ipynb (changing directory)
  2. Run filling_in_sequences.ipynb (changing directory)

Adding Data:

  1. Parse your data into a list of dictionaries, dataframe, or some other useable data format in Python with each experiment/PWM being one line.
  2. Use SQLAlchemy to add in the data to the database, checking for redundancies. Look at the section of databasebuilder.ipynb starting at the comment '#now on to uniprobe' for an example of checking for redundancies and adding new elements if unfamiliar with SQLAlchemy.
  3. Alternatively, if you know how to add data directly using SQL or R or another language, do that. The vast majority of fields are not required. If you want to know which are, look at DB_setup.py to see which elements are primary keys and are not autoincremented.

Description of Files

  • DB_setup.py: Current database schema created using SQLAlchemy's ORM.

  • Final_project_writeup.docx: Slightly outdated writeup on database, including outdated entity-relationship model and summary stats.

  • database_builder.ipynb: Jupyter notebook to generate the database from jaspar_success.json and uniprobe_final.json files. If rebuilding from scratch, should still use this script (modify it to use new JSONs), since it accurately generates the relationships.

  • filling_in_sequences.ipynb: Searches the database for proteins that have missing sequences and tries to get them from Uniprot using their Uniprot IDs, and updates the DB.

  • jaspar_db_to_dict.py: If one has JASPAR built as a MySQL-style database on their computer, this will convert it into a JSON file that can be put into database_builder.

  • jaspar_success.json: Most recent/complete JASPAR JSON file.

  • missing_uniprot_ids.ipynb: Jupyter notebook that searches for missing Uniprot IDs in the database, finds them, and updates the DB.

  • query1.sql: SQL query used in uniprobe_db_to_dict.py. Gets clone, sequence, (uniprobe internal) publication id, species, gene name, and gene mutant name (sometimes has information about partial constructs/mutations).

  • query2_pfam.sql: SQL query used in uniprobe_db_to_dict.py. Some proteins have Pfam information in Uniprobe, this query gets the DNA binding domain sequence and type for these (as well as the Pfam ID, which is not currentlly used).

  • query_dbds.sql: SQL query to get DNA binding domain sequences for some more proteins in Uniprobe. Only some mouse ones though...

  • query_publication_ids.sql: SQL query to get uniprobe publication IDs and the PWM folders associated with each publication.

  • search_uniprot_ids.py: Contains the get_uniprot_ids function, which searches Uniprot for a gene name and an optional species name to get Uniprot IDs.

  • tf.db: The current iteration of the database (SQLite).

  • uniprobe_db_to_dict.py: If one has Uniprobe built as a MySQL-style database on their computer, this will convert it into a JSON file that can be put into database_builder.

  • uniprobe_final.json: Most recent/complete Uniprobe JSON file.

  • uniprobe_pubmed_ids.json: Contains a dictionary mapping Uniprobe internal publication IDs to Pubmed IDs.

  • uniprobe_pwm_parser.py: Contains the function pwm_parser used in uniprobe_db_to_dict.py, which is used to read through Uniprobe's PWM files and get probability matrices, preferably generated using BEEML.

  • uniprot_domains_sequences.py: Contains the functions get_domains and get_sequences, which get DNA binding domain information and protein sequences, respectively, using uniprot IDs, using the Proteins API from EMBL: https://www.ebi.ac.uk/proteins/api/doc/#/

About

Notebooks and scripts for database collecting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors