Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Episode on parallel raster computations #90

Conversation

fnattino
Copy link
Contributor

This episode will address the second part of #82, including the initial feedback from @rbavery on #86 .

@fnattino fnattino marked this pull request as ready for review February 25, 2022 19:57
@fnattino
Copy link
Contributor Author

Hi @rbavery, sorry for taking so long with this episode. Would you be willing to give it a look and let me know your general impression?

I struggled a bit to find an example that would be fast enough to run locally (especially with the data downloading), but also intensive enough to show some effects of parallelisation. Ultimately, I have thought about having a comparison between a serial calculation and its parallel version where one sees how parallelisation can both help in some part and leave timings unaffected in some others. I think this could still be a relevant element to teach.

I have also dropped the memory profiling as it seemed a bit tricky to monitor, and thought the episode was already quite dense in content.

I might still add a final exercise on stackstac for participants to apply what they have learned with chunking/lazy execution in a slightly different context..

@fnattino fnattino requested a review from rbavery February 25, 2022 23:43
@rbavery
Copy link
Collaborator

rbavery commented Feb 27, 2022

Thanks @fnattino , reviewing this now.

I might still add a final exercise on stackstac for participants to apply what they have learned with chunking/lazy execution in a slightly different context..

A colleague of mine, @srmsoumya is working on a stackstac lesson #102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode?

Copy link
Collaborator

@rbavery rbavery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a bunch for this lesson @fnattino ! Left some comments and requests, I think this is close to ready to merge.

_episodes/20-parallel-raster-computations.md Outdated Show resolved Hide resolved
_episodes/20-parallel-raster-computations.md Outdated Show resolved Hide resolved
- "Recognize good practices in selecting proper chunk sizes."
- "Setup raster calculations that take advantage of parallelization."
keypoints:
- "TODO"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a start:

  • "chunks" are the unit of parallelization
  • Specifying the chunks argument with rioxarray.open_rasterio() opens a raster as a chunked array
  • the chunks tuple can be adapted to parallelize calculations across different raster dimensions
  • python's time module or the jupyter %%time magic command can be used to profile calculations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, thanks! I have included these with some slight changes and added one more.


> ## More Resources on Dask
>
> TODO: Dask and Dask Arrays, with links
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just link to this here? https://docs.dask.org/en/stable/array.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an extra link to Xarray&Dask as well!

_episodes/20-parallel-raster-computations.md Outdated Show resolved Hide resolved
_episodes/20-parallel-raster-computations.md Outdated Show resolved Hide resolved
_episodes/20-parallel-raster-computations.md Outdated Show resolved Hide resolved
> > {: .output}
> >
> > Ideal values are thus multiples of 1024. An element to consider is the number of resulting chunks and their size.
> > Chunks should not be too big nor too small (i.e. too many). Recommended chunk sizes are of the order of 100 MB.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this recommendation come from?

Recommended chunk sizes are of the order of 100 MB.

Is this because of the typical size of a raster file and typical block size? Or is this a Dask recommendation that applies more generally to Dask arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what I know, it was a rule of thumb for Dask arrays in generals. Maybe "recommendation" is too strong, I have soften it a bit by saying that this is typically a good value (and linked to the following Dask blog post: https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes). How do you see it now?

> > ~~~
> > {: .language-python}
> >
> > which leads to chunks 72 MB large. Also, we can let `rioxarray` and Dask figure out appropriate chunk shapes by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my math wrong? I get 36 Mb: (6144X6144/1024)/1024 = 36 Mb. This could be because of the numeric type of the values in the array. Can we include a python calc of the chunk size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the number of bytes per array element is missing (the calculation is missing a "x 2", since they are uint16). Good point, adding the multiplication will make it much clearer..


> ## Serial vs parallel
>
> Can you speculate why the parallel calculation actually took longer than its serial version?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be because of the size of the Dask task graph, that it's too large so the overhead of computing the task graph is more than the gains from parallelizing?

I'm not sure if we should make this a challenge, it might be too non-intuitive to those who haven't encountered Dask before. it's already non-intuitive for me!

Another cause could be that dask isn't parallelizing the writing step at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the task graph is relatively large, but also the chunk size that I have used here are small (4 and 32 MB), so the overhead is indeed larger than the gain.

But you are probably right in this being too non-intuitive for a challenge I will add these arguments just as regular teaching material.

@fnattino fnattino force-pushed the parallel-raster-computations branch from 165a87e to ee7de64 Compare March 3, 2022 21:44
@fnattino
Copy link
Contributor Author

fnattino commented Mar 3, 2022

Thanks a lot for the feedback @rbavery !

A colleague of mine, srmsoumya is working on a stackstac lesson #102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode?

Great, sounds like a plan then!

@fnattino fnattino requested a review from rbavery March 9, 2022 08:39
Copy link
Contributor

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fnattino thanks for the nice work 👍 I left some minor comments and suggestions on the changed files. Please let me know if something is unclear. I noticed that this lesson is very technical and some parts are explained in detail. In my opinion, some details can be skipped as they are not explicitly related to the objectives of the lesson, here are examples:

  • we aim to show the usage of %%time in the section Time profiling in Jupyter. I don't think it is needed to load two datasets for this purpose. Can we load only one dataset?
  • Explanations about the advantage of using cloud-optimized GeoTIFF (COG) format over reproject_match can be skipped as it is not related to the usage of %%time . Or can we split this section into two subsections?
  • we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB
  • we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.
  • perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.

I might miss some points here. So please feel free to ignore my comment about the technical details. :-)

@fnattino
Copy link
Contributor Author

fnattino commented Mar 16, 2022

Thanks a lot @SarahAlidoost for the detailed feedback, I have included basically all your suggestions. Just let me know if you think the header questions are now clearer.

Also @rbavery do you want to have a second look or you feel your points have been addressed? No rush at all, just want to make sure I do not merge before you have time to have a final check!

@fnattino fnattino force-pushed the parallel-raster-computations branch from dd86be6 to cf11c74 Compare March 16, 2022 15:13
Copy link
Contributor

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fnattino Thanks for addressing my comments. I approved it.

@rbavery
Copy link
Collaborator

rbavery commented Mar 17, 2022

@fnattino I'm in favor of moving all the following info identified by @SarahAlidoost to a callout. I find it useful and important but agree it could distract from the main objectives of the lesson if included in the main text.

we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB
In particular I found this calculation useful to show folks how array side substantially affects memory use.

we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.
I think it's useful to highlight this in a callout. more campuses and other orgs are providing access to Dask on clusters so it'd be good to let people know that Dask is an option that can stay with them even for jobs requiring clusters.

perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.
I'm in favor of the callout. looking at the complexity of the dask task graph can be a good, quick proxy for understanding why Dask doesn't lead to performance boosts.

@rbavery
Copy link
Collaborator

rbavery commented Mar 17, 2022

I think these items identified by @SarahAlidoost should all be moved to a callout as they contain useful info

we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB

shows folks how array size affects memory usage, gives them a point of reference that can help them when working with large rasters/arrays in memory

we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.

shows folks that Dask can still help them for cluster-sized jobs. more universities and orgs are offering Dask clusters via jupyterhubs or other services so I think this is good to highlight.

perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.

I think the task graph can be a good, quick proxy for whether a set of tasks is too complex for Dask to provide benefit and helps illustrate how Dask organizes lazy computation, in favor of callout

in any case @fnattino feel free to merge looks great!

@fnattino
Copy link
Contributor Author

fnattino commented Apr 1, 2022

Thanks again for your feedback @SarahAlidoost and @rbavery !

@fnattino fnattino merged commit 0942795 into carpentries-incubator:gh-pages Apr 1, 2022
@fnattino fnattino deleted the parallel-raster-computations branch April 1, 2022 15:08
@fnattino fnattino mentioned this pull request Apr 1, 2022
@rbavery
Copy link
Collaborator

rbavery commented Apr 4, 2022

woops, looks like I provided duplicate feedback here, thought the first comment didn't make it. Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants