integration #2

colindaven · 2022-01-25T15:47:39Z

this is still a private repository but I was using it to prototype some ideas on how to do the data integration of raspir, growth_rates, krakenuniq and metaphlan into haybaler.

To do data integration we need to improve the speed of raspir, which is very very slow at present. I'll update and handle this. Else the pipeline will be very slow on big datasets, and it needs to be fast. This is why we currently generate haybaler first, and raspir, grow rates etc later.

The files from raspir and reporting (the step before haybaler) are very similarly named (see README.md in this repo). Maybe it would be easiest to integrate at this point.

Can you have a look at this too - it's what Burkhard Tuemmler wants you to do before April, so it would be nice to do some further integration. Either within haybaler, or out of it.

I was looking at trying nextflow for this, but have only been working on it today, so we don't have to manually manage all the files being in and output. It could - theoretically - be a lot simpler, but a steeper learning curve at the start.

Nextflow can start scripts like Haybaler too, so may just replace the "runbatch_x" scripts rather than the pandas stuff.

cheers
Colin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration #2

integration #2

colindaven commented Jan 25, 2022

integration #2

integration #2

Comments

colindaven commented Jan 25, 2022