AlexKRotation: Notebooks and scripts for database collecting

Schema:

Quick Start:

This will allow you to build and clean the database completely from JSON files contained in this repo.

Run database_builder.ipynb (changing the directory/name as you see fit).

Optional: These will put in missing protein sequences and uniprot IDs, and fix some broken uniprot IDs from Jaspar.

Run missing_uniprot_ids.ipynb (changing directory)
Run filling_in_sequences.ipynb (changing directory)

Adding Data:

Parse your data into a list of dictionaries, dataframe, or some other useable data format in Python with each experiment/PWM being one line.
Use SQLAlchemy to add in the data to the database, checking for redundancies. Look at the section of databasebuilder.ipynb starting at the comment '#now on to uniprobe' for an example of checking for redundancies and adding new elements if unfamiliar with SQLAlchemy.
Alternatively, if you know how to add data directly using SQL or R or another language, do that. The vast majority of fields are not required. If you want to know which are, look at DB_setup.py to see which elements are primary keys and are not autoincremented.

Description of Files

DB_setup.py: Current database schema created using SQLAlchemy's ORM.
Final_project_writeup.docx: Slightly outdated writeup on database, including outdated entity-relationship model and summary stats.
database_builder.ipynb: Jupyter notebook to generate the database from jaspar_success.json and uniprobe_final.json files. If rebuilding from scratch, should still use this script (modify it to use new JSONs), since it accurately generates the relationships.
filling_in_sequences.ipynb: Searches the database for proteins that have missing sequences and tries to get them from Uniprot using their Uniprot IDs, and updates the DB.
jaspar_db_to_dict.py: If one has JASPAR built as a MySQL-style database on their computer, this will convert it into a JSON file that can be put into database_builder.
jaspar_success.json: Most recent/complete JASPAR JSON file.
missing_uniprot_ids.ipynb: Jupyter notebook that searches for missing Uniprot IDs in the database, finds them, and updates the DB.
query1.sql: SQL query used in uniprobe_db_to_dict.py. Gets clone, sequence, (uniprobe internal) publication id, species, gene name, and gene mutant name (sometimes has information about partial constructs/mutations).
query2_pfam.sql: SQL query used in uniprobe_db_to_dict.py. Some proteins have Pfam information in Uniprobe, this query gets the DNA binding domain sequence and type for these (as well as the Pfam ID, which is not currentlly used).
query_dbds.sql: SQL query to get DNA binding domain sequences for some more proteins in Uniprobe. Only some mouse ones though...
query_publication_ids.sql: SQL query to get uniprobe publication IDs and the PWM folders associated with each publication.
search_uniprot_ids.py: Contains the get_uniprot_ids function, which searches Uniprot for a gene name and an optional species name to get Uniprot IDs.
tf.db: The current iteration of the database (SQLite).
uniprobe_db_to_dict.py: If one has Uniprobe built as a MySQL-style database on their computer, this will convert it into a JSON file that can be put into database_builder.
uniprobe_final.json: Most recent/complete Uniprobe JSON file.
uniprobe_pubmed_ids.json: Contains a dictionary mapping Uniprobe internal publication IDs to Pubmed IDs.
uniprobe_pwm_parser.py: Contains the function pwm_parser used in uniprobe_db_to_dict.py, which is used to read through Uniprobe's PWM files and get probability matrices, preferably generated using BEEML.
uniprot_domains_sequences.py: Contains the functions get_domains and get_sequences, which get DNA binding domain information and protein sequences, respectively, using uniprot IDs, using the Proteins API from EMBL: https://www.ebi.ac.uk/proteins/api/doc/#/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlexKRotation: Notebooks and scripts for database collecting

Description of Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
DB_setup.py		DB_setup.py
Final_project_writeup.docx		Final_project_writeup.docx
README.md		README.md
database_builder.ipynb		database_builder.ipynb
filling_in_sequences.ipynb		filling_in_sequences.ipynb
jaspar_db_to_dict.py		jaspar_db_to_dict.py
jaspar_success.json		jaspar_success.json
missing_uniprot_IDs.ipynb		missing_uniprot_IDs.ipynb
query1.sql		query1.sql
query2_pfam.sql		query2_pfam.sql
query_dbds.sql		query_dbds.sql
query_publication_ids.sql		query_publication_ids.sql
schema.png		schema.png
search_uniprot_ids.py		search_uniprot_ids.py
tf.db		tf.db
uniprobe_db_to_dict.py		uniprobe_db_to_dict.py
uniprobe_final.json		uniprobe_final.json
uniprobe_pubmed_ids.json		uniprobe_pubmed_ids.json
uniprobe_pwm_parser.py		uniprobe_pwm_parser.py
uniprot_domains_sequences.py		uniprot_domains_sequences.py

Folders and files

Latest commit

History

Repository files navigation

AlexKRotation: Notebooks and scripts for database collecting

Description of Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages