Schema:
Quick Start:
This will allow you to build and clean the database completely from JSON files contained in this repo.
- Run database_builder.ipynb (changing the directory/name as you see fit).
Optional: These will put in missing protein sequences and uniprot IDs, and fix some broken uniprot IDs from Jaspar.
- Run missing_uniprot_ids.ipynb (changing directory)
- Run filling_in_sequences.ipynb (changing directory)
Adding Data:
- Parse your data into a list of dictionaries, dataframe, or some other useable data format in Python with each experiment/PWM being one line.
- Use SQLAlchemy to add in the data to the database, checking for redundancies. Look at the section of databasebuilder.ipynb starting at the comment '#now on to uniprobe' for an example of checking for redundancies and adding new elements if unfamiliar with SQLAlchemy.
- Alternatively, if you know how to add data directly using SQL or R or another language, do that. The vast majority of fields are not required. If you want to know which are, look at DB_setup.py to see which elements are primary keys and are not autoincremented.
-
DB_setup.py: Current database schema created using SQLAlchemy's ORM.
-
Final_project_writeup.docx: Slightly outdated writeup on database, including outdated entity-relationship model and summary stats.
-
database_builder.ipynb: Jupyter notebook to generate the database from jaspar_success.json and uniprobe_final.json files. If rebuilding from scratch, should still use this script (modify it to use new JSONs), since it accurately generates the relationships.
-
filling_in_sequences.ipynb: Searches the database for proteins that have missing sequences and tries to get them from Uniprot using their Uniprot IDs, and updates the DB.
-
jaspar_db_to_dict.py: If one has JASPAR built as a MySQL-style database on their computer, this will convert it into a JSON file that can be put into database_builder.
-
jaspar_success.json: Most recent/complete JASPAR JSON file.
-
missing_uniprot_ids.ipynb: Jupyter notebook that searches for missing Uniprot IDs in the database, finds them, and updates the DB.
-
query1.sql: SQL query used in uniprobe_db_to_dict.py. Gets clone, sequence, (uniprobe internal) publication id, species, gene name, and gene mutant name (sometimes has information about partial constructs/mutations).
-
query2_pfam.sql: SQL query used in uniprobe_db_to_dict.py. Some proteins have Pfam information in Uniprobe, this query gets the DNA binding domain sequence and type for these (as well as the Pfam ID, which is not currentlly used).
-
query_dbds.sql: SQL query to get DNA binding domain sequences for some more proteins in Uniprobe. Only some mouse ones though...
-
query_publication_ids.sql: SQL query to get uniprobe publication IDs and the PWM folders associated with each publication.
-
search_uniprot_ids.py: Contains the get_uniprot_ids function, which searches Uniprot for a gene name and an optional species name to get Uniprot IDs.
-
tf.db: The current iteration of the database (SQLite).
-
uniprobe_db_to_dict.py: If one has Uniprobe built as a MySQL-style database on their computer, this will convert it into a JSON file that can be put into database_builder.
-
uniprobe_final.json: Most recent/complete Uniprobe JSON file.
-
uniprobe_pubmed_ids.json: Contains a dictionary mapping Uniprobe internal publication IDs to Pubmed IDs.
-
uniprobe_pwm_parser.py: Contains the function pwm_parser used in uniprobe_db_to_dict.py, which is used to read through Uniprobe's PWM files and get probability matrices, preferably generated using BEEML.
-
uniprot_domains_sequences.py: Contains the functions get_domains and get_sequences, which get DNA binding domain information and protein sequences, respectively, using uniprot IDs, using the Proteins API from EMBL: https://www.ebi.ac.uk/proteins/api/doc/#/
