Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bloom filter #95

Open
donaldcampbelljr opened this issue Feb 19, 2025 · 1 comment
Open

Implement bloom filter #95

donaldcampbelljr opened this issue Feb 19, 2025 · 1 comment

Comments

@donaldcampbelljr
Copy link
Member

Work has successfully begun in this PR: #94
Currently bare bones.

@donaldcampbelljr
Copy link
Member Author

CLI

Added CLI yesterday for basic usage.

CREATE COMMAND EXAMPLE

./target/release/gtars igd bloom --action create 
--universe "/home/drc/Downloads/bloom_testing/real_data/data/universe.merged.pruned.filtered100k.bed" 
--bedfilesuniverse "/home/drc/Downloads/bloom_testing/test1/two_real_bed_files/" 
--bloomdirectory "/home/drc/Downloads/bloom_testing/test1/" 
--bloomname "test" 
--numitems 10000 
--falsepositive 0.001

SEARCH COMMAND EXAMPLE

./target/release/gtars igd bloom --action search 
--universe "/home/drc/Downloads/bloom_testing/real_data/data/universe.merged.pruned.filtered100k.bed" 
--bloomdirectory "/home/drc/Downloads/bloom_testing/test1/" 
--bloomname "test" 
--querybed "/home/drc/Downloads/bloom_testing/test1/query2.bed"

Performance Testing

Did some manual performance testing, would be nice to implement Criterion (https://crates.io/crates/criterion) for testing the permutations of variables including:

  • size of inputs files (# chroms, # of regions), size of universe for tokenization, size of query bedfile, number of hash functions, size of bloomfilter, false postive rate

Input files: total of approximately 8.5 million regions
Universe: 100k regions

Creating using Blooms:
~ 16 seconds
~4.4mb
Creating using IGD:
~8 seconds
~140mb

Querying a file with 24 regions:
bloom method:
0.071 seconds
Igd search:
0.009 seconds

Querying file with 7.1 millions regions:
bloom method:
11 seconds
igd search:
16 seconds

Bloom Filter Tree

Began work on implementing bloom filter tree with this commit: c749f21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant