Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docs): add information on how to speed up large pipelines #1599

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

rileyhgrant
Copy link
Contributor

@rileyhgrant rileyhgrant commented Jul 23, 2024

Adds one small paragraph about adding more workers to a dataproc cluster to run the computational portion of the variants pipeline in a shorter timeframe (~2 hours).

I had meant to add this last time I ran the browser pipeline, and found this information in my Slack messages. I figure its preferable to have it in the docs, as they currently only document how to speed up the hail table -> elasticsearch portion, not the generation of the hail table portion.


Edited for clarity

@rileyhgrant rileyhgrant self-assigned this Jul 23, 2024
Copy link
Contributor

@sjahl sjahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the added information here is already present in the doc... but, perhaps not finding it is a good indicator that we need to adjust things. See the discussion/comment inline

Comment on lines 1 to +9
# Loading large datasets

## Running a large pipeline

To speed up the execution time of large pipelines (such as the Short Variants), add additional workers nodes to the dataproc cluster you create.

```
./deployctl dataproc-cluster start variants --num-workers 32
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is kind of already about running large pipelines, so this feels a little redundant. Adding additional workers is documented in Step 4 -- however, that step instructs using preemptible workers instead of persistent ones. Is there a big difference here in your experience?

This being said, things are still a little confusing; perhaps we could have an intro paragraph here that more clearly lays out selecting the number of dataproc nodes/loading pods, and the relationship between the two (namely, that they should be equal)? Or maybe a tl;dr that overviews the process to make following the doc easier?

Copy link
Contributor Author

@rileyhgrant rileyhgrant Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think these are two similar, but different steps of the overall pipeline process that each need more resources to allow large datasets to be processed in a reasonable amount of time.

The added part about including non-preemptible workers speeds up the execution of the pipeline that produces the final hail table (e.g. ./deployctl data-pipeline run --cluster v4p1 gnomad_v4_variants). Last time I ran the full variants pipeline, secondary workers and preemptible workers both did nothing to speed up execution of this computational portion of the pipeline.

The part documented in step 4 speeds up the loading of the final hail table into Elasticsearch (e.g. ./deployctl elasticsearch load-datasets --dataproc-cluster es gnomad_v4_variants). This section is certainly better documented than the little bit I added.

I do think it would be nice to have all the information of how to run a computationally expensive pipeline then load the resulting hail table into Elasticsearch in a single doc.

@rileyhgrant rileyhgrant merged commit e9acd8d into main Jan 13, 2025
3 checks passed
@rileyhgrant rileyhgrant deleted the update-big-data-docs branch January 15, 2025 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants