-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Episode on parallel raster computations #90
Episode on parallel raster computations #90
Conversation
Hi @rbavery, sorry for taking so long with this episode. Would you be willing to give it a look and let me know your general impression? I struggled a bit to find an example that would be fast enough to run locally (especially with the data downloading), but also intensive enough to show some effects of parallelisation. Ultimately, I have thought about having a comparison between a serial calculation and its parallel version where one sees how parallelisation can both help in some part and leave timings unaffected in some others. I think this could still be a relevant element to teach. I have also dropped the memory profiling as it seemed a bit tricky to monitor, and thought the episode was already quite dense in content. I might still add a final exercise on |
Thanks @fnattino , reviewing this now.
A colleague of mine, @srmsoumya is working on a stackstac lesson #102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a bunch for this lesson @fnattino ! Left some comments and requests, I think this is close to ready to merge.
- "Recognize good practices in selecting proper chunk sizes." | ||
- "Setup raster calculations that take advantage of parallelization." | ||
keypoints: | ||
- "TODO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a start:
- "chunks" are the unit of parallelization
- Specifying the
chunks
argument withrioxarray.open_rasterio()
opens a raster as a chunked array- the chunks tuple can be adapted to parallelize calculations across different raster dimensions
- python's
time
module or the jupyter%%time
magic command can be used to profile calculations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful, thanks! I have included these with some slight changes and added one more.
|
||
> ## More Resources on Dask | ||
> | ||
> TODO: Dask and Dask Arrays, with links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just link to this here? https://docs.dask.org/en/stable/array.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an extra link to Xarray&Dask as well!
> > {: .output} | ||
> > | ||
> > Ideal values are thus multiples of 1024. An element to consider is the number of resulting chunks and their size. | ||
> > Chunks should not be too big nor too small (i.e. too many). Recommended chunk sizes are of the order of 100 MB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this recommendation come from?
Recommended chunk sizes are of the order of 100 MB.
Is this because of the typical size of a raster file and typical block size? Or is this a Dask recommendation that applies more generally to Dask arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what I know, it was a rule of thumb for Dask arrays in generals. Maybe "recommendation" is too strong, I have soften it a bit by saying that this is typically a good value (and linked to the following Dask blog post: https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes). How do you see it now?
> > ~~~ | ||
> > {: .language-python} | ||
> > | ||
> > which leads to chunks 72 MB large. Also, we can let `rioxarray` and Dask figure out appropriate chunk shapes by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my math wrong? I get 36 Mb: (6144X6144/1024)/1024 = 36 Mb. This could be because of the numeric type of the values in the array. Can we include a python calc of the chunk size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the number of bytes per array element is missing (the calculation is missing a "x 2", since they are uint16
). Good point, adding the multiplication will make it much clearer..
|
||
> ## Serial vs parallel | ||
> | ||
> Can you speculate why the parallel calculation actually took longer than its serial version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be because of the size of the Dask task graph, that it's too large so the overhead of computing the task graph is more than the gains from parallelizing?
I'm not sure if we should make this a challenge, it might be too non-intuitive to those who haven't encountered Dask before. it's already non-intuitive for me!
Another cause could be that dask isn't parallelizing the writing step at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the task graph is relatively large, but also the chunk size that I have used here are small (4 and 32 MB), so the overhead is indeed larger than the gain.
But you are probably right in this being too non-intuitive for a challenge I will add these arguments just as regular teaching material.
165a87e
to
ee7de64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fnattino thanks for the nice work 👍 I left some minor comments and suggestions on the changed files. Please let me know if something is unclear. I noticed that this lesson is very technical and some parts are explained in detail. In my opinion, some details can be skipped as they are not explicitly related to the objectives of the lesson, here are examples:
- we aim to show the usage of
%%time
in the sectionTime profiling in Jupyter
. I don't think it is needed to load two datasets for this purpose. Can we load only one dataset? - Explanations about the advantage of using
cloud-optimized GeoTIFF (COG) format
overreproject_match
can be skipped as it is not related to the usage of%%time
. Or can we split this section into two subsections? - we can skip the line
which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB
- we can skip this line
By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.
- perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.
I might miss some points here. So please feel free to ignore my comment about the technical details. :-)
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Co-authored-by: SarahAlidoost <[email protected]>
Thanks a lot @SarahAlidoost for the detailed feedback, I have included basically all your suggestions. Just let me know if you think the header questions are now clearer. Also @rbavery do you want to have a second look or you feel your points have been addressed? No rush at all, just want to make sure I do not merge before you have time to have a final check! |
dd86be6
to
cf11c74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fnattino Thanks for addressing my comments. I approved it.
@fnattino I'm in favor of moving all the following info identified by @SarahAlidoost to a callout. I find it useful and important but agree it could distract from the main objectives of the lesson if included in the main text.
|
I think these items identified by @SarahAlidoost should all be moved to a callout as they contain useful info
shows folks how array size affects memory usage, gives them a point of reference that can help them when working with large rasters/arrays in memory
shows folks that Dask can still help them for cluster-sized jobs. more universities and orgs are offering Dask clusters via jupyterhubs or other services so I think this is good to highlight.
I think the task graph can be a good, quick proxy for whether a set of tasks is too complex for Dask to provide benefit and helps illustrate how Dask organizes lazy computation, in favor of callout in any case @fnattino feel free to merge looks great! |
Thanks again for your feedback @SarahAlidoost and @rbavery ! |
woops, looks like I provided duplicate feedback here, thought the first comment didn't make it. Great work! |
This episode will address the second part of #82, including the initial feedback from @rbavery on #86 .