Skip to content

quantori/qtr-fingerprint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

qtr-fingerprint

Project Structure

Core components

  • cpp/preprocessing - Code for data preprocessing.
  • cpp/build_db - Code for building search index.
  • cpp/run_db - Code for executing queries on the search index.
  • cpp/qtrlib - Main library containing essential components for the Qtr algorithm for substructure search.

Additional Resources

In addition to the core components, we provide supplementary folders for experimental purposes:

Console Applications

This project provides several console applications for working with a molecular dataset. There are three types of databases:

  • QtrRam - A Qtr search index that loads data into RAM when working with the database.
  • BingoNoSQL - A database that uses the Bingo NoSQL search index.
  • QtrDrive - (Not implemented yet) A Qtr search index that stores data on a hard drive using memory mapping, allowing the algorithm to work under limited memory conditions.

The Qtr databases also support molecular property-based search. The following molecular properties are supported:

  • PUBCHEM_COMPONENT_COUNT
  • PUBCHEM_XLOGP3
  • PUBCHEM_ATOM_UDEF_STEREO_COUNT
  • PUBCHEM_HEAVY_ATOM_COUNT
  • PUBCHEM_CACTVS_TAUTO_COUNT
  • PUBCHEM_ISOTOPIC_ATOM_COUNT
  • PUBCHEM_CACTVS_HBOND_DONOR
  • PUBCHEM_CACTVS_ROTATABLE_BOND
  • PUBCHEM_MONOISOTOPIC_WEIGHT
  • PUBCHEM_CACTVS_HBOND_ACCEPTOR
  • PUBCHEM_ATOM_DEF_STEREO_COUNT
  • PUBCHEM_COMPOUND_CID
  • PUBCHEM_MOLECULAR_WEIGHT
  • PUBCHEM_BOND_DEF_STEREO_COUNT
  • PUBCHEM_TOTAL_CHARGE
  • PUBCHEM_EXACT_MASS
  • PUBCHEM_CACTVS_COMPLEXITY
  • PUBCHEM_BOND_UDEF_STEREO_COUNT
  • PUBCHEM_CACTVS_TPSA
  • PUBCHEM_COMPOUND_CANONICALIZED

preprocessing

Data Preprocessing: This component is responsible for generating fingerprints, molecule numbering, and more.

  • --properties - Set to 1 if properties should be considered during preprocessing; otherwise, set to 0.
  • --sourceDir - Path to the directory where source files are stored.
  • --destDir - Destination directory for storing preprocessed files.
  • --preprocessingType - Source file type (SDF or CSV).
    • CSV - If this preprocessing type is selected, tables in .csv format are expected in the sourceDir. The first column should contain the molecule ID, and the second column should contain the molecule in SMILES format. If the --properties flag is set, columns 2-22 should contain molecule property values.
    • SDF - Preprocessing of molecules in .sdf format (does not support --properties).

build_db

Search Index Construction: This component is responsible for building the search index.

  • --dbType - QtrDrive/QtrRam/BingoNoSQL.
  • --properties - Set to 1 if properties should be considered during preprocessing (only for QtrRam).
  • --dbName - Name of the search index.
  • --sourceDir - Folder with preprocessed files generated by the preprocessing program.
  • --destDirs - Folders where search index files should be saved (for BingoNoSQL and QtrDrive, only the first path is used; for QtrRam, it makes sense to specify more than one folder if they are located on different hard drives to improve parallelism).
  • --otherDataDir - Folder where search index files that cannot be stored in parallel are saved (only for QtrRam).
  • --parallelizeDepth - Depth at which sub-trees of the search index should be built in parallel (affects construction speed, only for QtrRam and QtrDrive).
  • --treeDepth - Depth to which the tree should be built (relevant only for QtrRam and QtrDrive).

run_db

Working with the Search Index: This component is responsible for querying the search index.

  • --dbType, --properties, --dbName - See build_db.
  • --dataDirs - Folders where the generated search index files are stored.
  • --otherDataDir - Folder where search index files that cannot be stored in parallel are saved (only for QtrRam).
  • --threads - Number of threads for query execution.
  • --mode - Interactive / FromFile / Web
    • Interactive - Application for manual querying of the database through the console, suitable for simple manual tests.
    • FromFile - Application for benchmarking using molecules in SMILES format stored in the queriesFile file.
    • Web - Application for querying the database through a REST API.
  • --queriesFile - File with SMILES molecules for benchmarking (only for FromFile).
  • --ansCount - Maximum number of answers the application can return for a single query (if there are more answers than ansCount, some will be discarded).
  • --timeLimit - Time limit in seconds for executing a single query.
  • --verificationStage - True if verification stage should be executed after screening, False if only screening stage should be executed.

Benchmarking and Research

The results described in this article were obtained using this dataset. The set of queries can be found in this file. The comparisons in the article are between QtrRam and BingoNoSQL. The runs were performed with the following arguments:

  • --threads=1
  • --ansCount=10000
  • --timeLimit=60
  • --treeDepth=21
  • --parallelizeDepth=4
  • --properties=0

Requirements

  • CMake 3.13 or higher
  • ninja 1.7.2 or higher
  • libfreetype6-dev, libfontconfig1-dev, libasio-dev, libgflags-dev libs

apt-get install libfreetype6-dev libfontconfig1-dev libasio-dev libgflags-dev to install them all

  • g++ 9.4 or higher

Build and run

  1. git clone https://github.com/quantori/qtr-fingerprint.git
  2. cd ./qtr-fingerprint/cpp
  3. git submodule update --init --recursive
  4. cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_MAKE_PROGRAM=ninja -G Ninja -S ./cpp -B ./cpp/cmake-build-release
  5. cmake --build ./cpp/cmake-build-release --target preprocessing -j 16 (Possible targets: preprocessing, build_db, run_db, tests)
  6. Executables are located in ./cpp/cmake-build-release/bin

Testing

tests arguments are:

  • --data_dir_path - directory with test data. Test data is located in data.
  • --big_data_dir_path - directory with big test data. Big test data is located in data.
  • --tmp_data_dir_path - directory where temporary data would be stored while testing. You can use, for example data.

Log configuration

In order to configure logger, you should add environment variables, most common with their default values are listed below:

  1. GLOG_log_dir=
  2. GLOG_alsologtostderr=false
  3. GLOG_logtostdout=false
  4. GLOG_minloglevel=0, order of levels are: INFO, WARNING, ERROR, FATAL

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published