$ git clone https://github.com/cloudozer/BWT.git
$ cd BWT
$ git checkout master
$ ./rebar get-deps
# Setup domain config files (bwtm.dom and bwtw.dom)
$ make
- Download an archive: https://docs.google.com/uc?id=0B2DPaltm6IwpYVFHOEZYSGpldHc&export=download
- Extract it to the BWT folder
$ ./scripts/start_local.sh SRR770176_1.fastq GL000193.1 2
- Friendly Linux
- Xen
- Erlang OTP 17
- Git
- Internet access
Edit domain config file 'bwtm.dom', setup expected number of workers, ssh port, etc.
$ make
$ sudo xl create -c bwtm.dom
Edit domain config file 'bwtm.dom', setup master ip address, etc.
$ make
$ sudo xl create -c bwtw.dom
$ ssh %NODE_HOST% -p %PORT% # (password: 1)
[Disregards Info below this line]
These are small tools to help process large files in Erlang. In general, the strategy is to read in the file as an array of possibly overlapping Erlang binary "chunks". These can then be processed in parallel/concurrently.
- download bio_pfile.erl
- launch Erlang shell
- compile: c(bio_pfil)
- run: bio_pfile:read(Filename,NumerOfChunks) or bio_pfile:read(FileName,NumberOfChunks,SizeOfOverlap) both of which return an array of chunk elements: {{StartPos,Length},BinaryData}
1> Data = bio_pfile:read("../data/GCA_000001405.15_GRCh38_full_analysis_set.fna",10000).
[{{0,32553715},<<">chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS"...>>},
{{32543715,32563715},<<"ACCTCATAGATTGGTCATCTTTTTCTC\nCTATATTTCTCTAATATTTAATCTCTCTCTCTCTCTCTCTTTGTATGTGCATTGCCTTTGGAGAGATTTC\nC"...>>},
{{65097430,32563715},<<"AATCAAGAAAATATGTTTACCAAAA\nTGCATTGCAATTTTCCCAAACCTGAGTCTTCAAATAACAAACATGAACTTATAGGTACTGTGAACTAGAA"...>>},
{{97651145,32563715},<<"CAAGAATTGAGGTTTGGGAAACT\nCCATCTAGATTTCAGAGGATGTATGGAAATACCTGGATGTCCAGGCAGTAGTTTGCTGCAAGGGTGTG"...>>},
{{813832875,32563715},<<"TT\nT"...>>},
{{846386590,...},<<...>>},
{{...},...},
{...}|...]
2> length(Data).
10001
3> lists:nth(1,Data).
{{0,32553715},<<">chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRC"...>>}
4>
- run: bio_pfile:spawn_find_pattern(ChunkArray,BinaryPattern) which returns an array of all the stat positions where the pattern was found as {StartPosition,LengthOfPattern}.
4> bio_pfile:spawn_find_pattern(Data,<<"TATATTCAGTCTTTCTAACACCATTTATTGAAGAGACTGTAG">>).
[{162758595,42}]