Repository for the development of the DNA Sequence Extension for Postgres as part of the 2024 INFO-H417 Database System Architecture project. This repository contains the extension template and scripts used to create the extension as well as the datasets and sql functions to test it out.
Check the final presentation
Group Members:
Professor: Mahmoud Sakr
The extension's template. Contains both a c and an sql file with the extension's data types, functions and operators as well as the Makefile necesarry for its installation.
- dna_sequence--1.0.sql: SQL script used to create the extension.
- dna_sequence.h: Header file for the extension (used by both C files).
- dna_sequence.c: C source code for data types and functions used to create the extension.
- spg_kmer.c: C source code for index implementation.
- Makefile: Makefile used to compile the extension.
- dna_sequence.control: Control file for the extension.
Collection of Python scripts used for data generation.
- dna_db_generator.py: Python script used to parse through a collection of .fastq files downloaded using the NCBI SRA Toolkit and create usable collections of real world data for testing the extension. Returns both .csv and .sql files for ease of data import.
- unique_dna_kmers_generation.py: Python script for generating additional K-mer sequences based on the DNA sequences obtained from the .fastq files. This was used because we needed a large amount of K-mer objects to properly test index functionality and the real world data did not naturally provide enough K-mer sequences by itself. Requires the existence of a DNA.csv file.
- files:
- fastqs: Directory to store .fastq files obtained from the NCBI database containing real world sequences. The link to the .zip file containing a small collection of these files for testing is found in the data generation instructions.
- csv_files: Directory for .csv files created by the generator script.
- sql_inserts: Directory for .sql files that can be run to create, and populate a database of each file type.
- dna_databases.sql: .sql file for creating the three empty tables that can be populated using the .csv or .sql files created by the script.
Collection of queries used for validating the proper functionality of the extension as per the project requirement.
This project relies on the user having already installed PostgreSQL and being able to use PLQL commands.
Follow these steps to properly install the extension:
-
Clone the github directory to the desired location.
-
Open a terminal and go to the dna-sequence-postgres-extension directory.
-
Run
make
command to compile the extension.
⚠️ Attention: In the latest versions of Postgres, when running make, the following error will be shown:spg_kmer.c:123:20: error: ‘VARHDRSZ_SHORT’ undeclared (first use in this function) if (datalen + VARHDRSZ_SHORT <= VARATT_SHORT_MAX) ...
In this case, modify the
spg_kmer.c
by adding#include "varatt.h"
at the beginning of the file. -
Run
sudo make install
command to install the extension.
- Connect to the desired PostgreSQL server and run:
CREATE EXTENSION dna_sequence;
to create the extension.
If the steps above are followed correctly the PLSQL terminal should show a CREATE EXTENSION message.
The data generation tool uses .fastq files stored in the fastqs directory to generate the proper .csv and .sql files. A small collection of these files can be found in the here. The script also requires a working Python installation with the Pandas library installed.
- Unzip the downloaded file and place its contents in the dna_db_generator/files/fastqs directory.
- Run dna_db_generator.py to create three distinct .csv files in the csv_files directory and three distinct .sql files in the sql_files director.
- Depending on the user the resulting data can be imported into the desired schema in the following manners:
- CSV Files: Execute the dna_databases.sql script to create the required tables and then use a DBMS like pgAdmin to import the three files created in the csv_files directory.
- SQL Files: Run each file found in the sql_files directory. Each .sql file creates the proper table and then populates the table through a series of value inserts.
- In case that more K-mers are needed for index testing, run unique_dna_kmers_generation.py which will utilize the dna.csv file to generate new unique K-mers in a .csv file.
test_queries.sql contains a series of different queries that can be executed in order to test the proper functionality of the extension. These queries range from simple data type tests to validate length and character constraints to function tests to validate proper implementation of the required functions. Each one shows the expected output based on what was generated when running the queries with our test database. Additionally the file contains a test of the index functionality with a large collection of K-mer values.