- cpp/preprocessing - Code for data preprocessing.
- cpp/build_db - Code for building search index.
- cpp/run_db - Code for executing queries on the search index.
- cpp/qtrlib - Main library containing essential components for the Qtr algorithm for substructure search.
In addition to the core components, we provide supplementary folders for experimental purposes:
- python - Experimental Python code.
- cpp/playground - Experimental C++ code.
- notebooks - Jupyter notebooks used for research purposes.
This project provides several console applications for working with a molecular dataset. There are three types of databases:
QtrRam
- A Qtr search index that loads data into RAM when working with the database.BingoNoSQL
- A database that uses the Bingo NoSQL search index.QtrDrive
- (Not implemented yet) A Qtr search index that stores data on a hard drive using memory mapping, allowing the algorithm to work under limited memory conditions.
The Qtr databases also support molecular property-based search. The following molecular properties are supported:
PUBCHEM_COMPONENT_COUNT
PUBCHEM_XLOGP3
PUBCHEM_ATOM_UDEF_STEREO_COUNT
PUBCHEM_HEAVY_ATOM_COUNT
PUBCHEM_CACTVS_TAUTO_COUNT
PUBCHEM_ISOTOPIC_ATOM_COUNT
PUBCHEM_CACTVS_HBOND_DONOR
PUBCHEM_CACTVS_ROTATABLE_BOND
PUBCHEM_MONOISOTOPIC_WEIGHT
PUBCHEM_CACTVS_HBOND_ACCEPTOR
PUBCHEM_ATOM_DEF_STEREO_COUNT
PUBCHEM_COMPOUND_CID
PUBCHEM_MOLECULAR_WEIGHT
PUBCHEM_BOND_DEF_STEREO_COUNT
PUBCHEM_TOTAL_CHARGE
PUBCHEM_EXACT_MASS
PUBCHEM_CACTVS_COMPLEXITY
PUBCHEM_BOND_UDEF_STEREO_COUNT
PUBCHEM_CACTVS_TPSA
PUBCHEM_COMPOUND_CANONICALIZED
Data Preprocessing: This component is responsible for generating fingerprints, molecule numbering, and more.
--properties
- Set to 1 if properties should be considered during preprocessing; otherwise, set to 0.--sourceDir
- Path to the directory where source files are stored.--destDir
- Destination directory for storing preprocessed files.--preprocessingType
- Source file type (SDF
orCSV
).CSV
- If this preprocessing type is selected, tables in.csv
format are expected in thesourceDir
. The first column should contain the molecule ID, and the second column should contain the molecule inSMILES
format. If the--properties
flag is set, columns 2-22 should contain molecule property values.SDF
- Preprocessing of molecules in.sdf
format (does not support--properties
).
Search Index Construction: This component is responsible for building the search index.
--dbType
-QtrDrive
/QtrRam
/BingoNoSQL
.--properties
- Set to 1 if properties should be considered during preprocessing (only forQtrRam
).--dbName
- Name of the search index.--sourceDir
- Folder with preprocessed files generated by the preprocessing program.--destDirs
- Folders where search index files should be saved (forBingoNoSQL
andQtrDrive
, only the first path is used; forQtrRam
, it makes sense to specify more than one folder if they are located on different hard drives to improve parallelism).--otherDataDir
- Folder where search index files that cannot be stored in parallel are saved (only forQtrRam
).--parallelizeDepth
- Depth at which sub-trees of the search index should be built in parallel (affects construction speed, only forQtrRam
andQtrDrive
).--treeDepth
- Depth to which the tree should be built (relevant only forQtrRam
andQtrDrive
).
Working with the Search Index: This component is responsible for querying the search index.
--dbType
,--properties
,--dbName
- See build_db.--dataDirs
- Folders where the generated search index files are stored.--otherDataDir
- Folder where search index files that cannot be stored in parallel are saved (only forQtrRam
).--threads
- Number of threads for query execution.--mode
-Interactive
/FromFile
/Web
Interactive
- Application for manual querying of the database through the console, suitable for simple manual tests.FromFile
- Application for benchmarking using molecules in SMILES format stored in thequeriesFile
file.Web
- Application for querying the database through a REST API.
--queriesFile
- File with SMILES molecules for benchmarking (only forFromFile
).--ansCount
- Maximum number of answers the application can return for a single query (if there are more answers thanansCount
, some will be discarded).--timeLimit
- Time limit in seconds for executing a single query.--verificationStage
- True if verification stage should be executed after screening, False if only screening stage should be executed.
The results described in this article were obtained
using this dataset. The
set of queries can be found in this file. The comparisons in the article are
between QtrRam
and BingoNoSQL
. The runs were performed with the following arguments:
--threads=1
--ansCount=10000
--timeLimit=60
--treeDepth=21
--parallelizeDepth=4
--properties=0
CMake 3.13 or higher
ninja 1.7.2 or higher
libfreetype6-dev
,libfontconfig1-dev
,libasio-dev
,libgflags-dev
libs
apt-get install libfreetype6-dev libfontconfig1-dev libasio-dev libgflags-dev
to install them all
g++ 9.4 or higher
git clone https://github.com/quantori/qtr-fingerprint.git
cd ./qtr-fingerprint/cpp
git submodule update --init --recursive
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_MAKE_PROGRAM=ninja -G Ninja -S ./cpp -B ./cpp/cmake-build-release
cmake --build ./cpp/cmake-build-release --target preprocessing -j 16
(Possible targets:preprocessing
,build_db
,run_db
,tests
)- Executables are located in
./cpp/cmake-build-release/bin
tests
arguments are:
--data_dir_path
- directory with test data. Test data is located in data.--big_data_dir_path
- directory with big test data. Big test data is located in data.--tmp_data_dir_path
- directory where temporary data would be stored while testing. You can use, for example data.
In order to configure logger, you should add environment variables, most common with their default values are listed below:
GLOG_log_dir=
GLOG_alsologtostderr=false
GLOG_logtostdout=false
GLOG_minloglevel=0
, order of levels are:INFO, WARNING, ERROR, FATAL