Episode on parallel raster computations #90

fnattino · 2022-01-28T08:13:55Z

This episode will address the second part of #82, including the initial feedback from @rbavery on #86 .

fnattino · 2022-02-25T23:43:05Z

Hi @rbavery, sorry for taking so long with this episode. Would you be willing to give it a look and let me know your general impression?

I struggled a bit to find an example that would be fast enough to run locally (especially with the data downloading), but also intensive enough to show some effects of parallelisation. Ultimately, I have thought about having a comparison between a serial calculation and its parallel version where one sees how parallelisation can both help in some part and leave timings unaffected in some others. I think this could still be a relevant element to teach.

I have also dropped the memory profiling as it seemed a bit tricky to monitor, and thought the episode was already quite dense in content.

I might still add a final exercise on stackstac for participants to apply what they have learned with chunking/lazy execution in a slightly different context..

rbavery · 2022-02-27T21:45:35Z

Thanks @fnattino , reviewing this now.

I might still add a final exercise on stackstac for participants to apply what they have learned with chunking/lazy execution in a slightly different context..

A colleague of mine, @srmsoumya is working on a stackstac lesson #102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode?

rbavery

thanks a bunch for this lesson @fnattino ! Left some comments and requests, I think this is close to ready to merge.

_episodes/20-parallel-raster-computations.md

rbavery · 2022-02-27T22:14:32Z

_episodes/20-parallel-raster-computations.md

+- "Recognize good practices in selecting proper chunk sizes."
+- "Setup raster calculations that take advantage of parallelization."
+keypoints:
+- "TODO"


Here's a start:

"chunks" are the unit of parallelization

Specifying the chunks argument with rioxarray.open_rasterio() opens a raster as a chunked array

the chunks tuple can be adapted to parallelize calculations across different raster dimensions

python's time module or the jupyter %%time magic command can be used to profile calculations

Wonderful, thanks! I have included these with some slight changes and added one more.

rbavery · 2022-02-27T22:21:16Z

_episodes/20-parallel-raster-computations.md

+
+> ## More Resources on Dask
+>
+> TODO: Dask and Dask Arrays, with links


should we just link to this here? https://docs.dask.org/en/stable/array.html

Added an extra link to Xarray&Dask as well!

_episodes/20-parallel-raster-computations.md

rbavery · 2022-02-27T22:31:01Z

_episodes/20-parallel-raster-computations.md

+> > {: .output}
+> >
+> > Ideal values are thus multiples of 1024. An element to consider is the number of resulting chunks and their size.
+> > Chunks should not be too big nor too small (i.e. too many). Recommended chunk sizes are of the order of 100 MB.


where does this recommendation come from?

Recommended chunk sizes are of the order of 100 MB.

Is this because of the typical size of a raster file and typical block size? Or is this a Dask recommendation that applies more generally to Dask arrays?

For what I know, it was a rule of thumb for Dask arrays in generals. Maybe "recommendation" is too strong, I have soften it a bit by saying that this is typically a good value (and linked to the following Dask blog post: https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes). How do you see it now?

rbavery · 2022-02-27T22:37:22Z

_episodes/20-parallel-raster-computations.md

+> > ~~~
+> > {: .language-python}
+> >
+> > which leads to chunks 72 MB large. Also, we can let `rioxarray` and Dask figure out appropriate chunk shapes by


Is my math wrong? I get 36 Mb: (6144X6144/1024)/1024 = 36 Mb. This could be because of the numeric type of the values in the array. Can we include a python calc of the chunk size?

Indeed, the number of bytes per array element is missing (the calculation is missing a "x 2", since they are uint16). Good point, adding the multiplication will make it much clearer..

rbavery · 2022-02-27T22:48:50Z

_episodes/20-parallel-raster-computations.md

+
+> ## Serial vs parallel
+>
+> Can you speculate why the parallel calculation actually took longer than its serial version?


Could this be because of the size of the Dask task graph, that it's too large so the overhead of computing the task graph is more than the gains from parallelizing?

I'm not sure if we should make this a challenge, it might be too non-intuitive to those who haven't encountered Dask before. it's already non-intuitive for me!

Another cause could be that dask isn't parallelizing the writing step at all?

Indeed the task graph is relatively large, but also the chunk size that I have used here are small (4 and 32 MB), so the overhead is indeed larger than the gain.

But you are probably right in this being too non-intuitive for a challenge I will add these arguments just as regular teaching material.

fnattino · 2022-03-03T22:18:58Z

Thanks a lot for the feedback @rbavery !

A colleague of mine, srmsoumya is working on a stackstac lesson #102 so maybe we can introduce stackstac there instead of including it in this episode? And then that stackstac episode could build off of this episode?

Great, sounds like a plan then!

_episodes/20-parallel-raster-computations.md

SarahAlidoost

@fnattino thanks for the nice work 👍 I left some minor comments and suggestions on the changed files. Please let me know if something is unclear. I noticed that this lesson is very technical and some parts are explained in detail. In my opinion, some details can be skipped as they are not explicitly related to the objectives of the lesson, here are examples:

we aim to show the usage of %%time in the section Time profiling in Jupyter. I don't think it is needed to load two datasets for this purpose. Can we load only one dataset?
Explanations about the advantage of using cloud-optimized GeoTIFF (COG) format over reproject_match can be skipped as it is not related to the usage of %%time . Or can we split this section into two subsections?
we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB
we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.
perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.

I might miss some points here. So please feel free to ignore my comment about the technical details. :-)

Co-authored-by: SarahAlidoost <[email protected]>

fnattino · 2022-03-16T15:01:49Z

Thanks a lot @SarahAlidoost for the detailed feedback, I have included basically all your suggestions. Just let me know if you think the header questions are now clearer.

Also @rbavery do you want to have a second look or you feel your points have been addressed? No rush at all, just want to make sure I do not merge before you have time to have a final check!

SarahAlidoost

@fnattino Thanks for addressing my comments. I approved it.

rbavery · 2022-03-17T18:20:01Z

@fnattino I'm in favor of moving all the following info identified by @SarahAlidoost to a callout. I find it useful and important but agree it could distract from the main objectives of the lesson if included in the main text.

we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB
In particular I found this calculation useful to show folks how array side substantially affects memory use.

we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.
I think it's useful to highlight this in a callout. more campuses and other orgs are providing access to Dask on clusters so it'd be good to let people know that Dask is an option that can stay with them even for jobs requiring clusters.

perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.
I'm in favor of the callout. looking at the complexity of the dask task graph can be a good, quick proxy for understanding why Dask doesn't lead to performance boosts.

rbavery · 2022-03-17T18:26:10Z

I think these items identified by @SarahAlidoost should all be moved to a callout as they contain useful info

we can skip the line which leads to chunks 72 MB large: (1 x 6144 x 6144) elements, 2 bytes per element (the data type is unsigned integer uint16), i.e., 6144 x 6144 x 2 / 2^20 = 72 MB

shows folks how array size affects memory usage, gives them a point of reference that can help them when working with large rasters/arrays in memory

we can skip this line By default, Dask parallelizes operations on the CPUs that are available on the same machine, but it can be configured to dispatch tasks on large compute clusters.

shows folks that Dask can still help them for cluster-sized jobs. more universities and orgs are offering Dask clusters via jupyterhubs or other services so I think this is good to highlight.

perhaps we can skip the explanations about Dask graph or move it to a callout as a piece of additional information.

I think the task graph can be a good, quick proxy for whether a set of tasks is too complex for Dask to provide benefit and helps illustrate how Dask organizes lazy computation, in favor of callout

in any case @fnattino feel free to merge looks great!

fnattino · 2022-04-01T15:08:10Z

Thanks again for your feedback @SarahAlidoost and @rbavery !

rbavery · 2022-04-04T04:04:15Z

woops, looks like I provided duplicate feedback here, thought the first comment didn't make it. Great work!

fnattino added 6 commits January 28, 2022 09:09

add draft notebook

3c0d56f

add first draft of md document for episode and figures

4436c4d

improve text

22426c3

Continue with description

1ed9f28

finish first version of the episode

73e46eb

minor fixes

8024e42

fnattino marked this pull request as ready for review February 25, 2022 19:57

fnattino requested a review from rbavery February 25, 2022 23:43

rbavery requested changes Feb 27, 2022

View reviewed changes

fnattino added 4 commits March 3, 2022 16:42

add guess for timings

339e5c9

Ryan's suggestion: add questions and keypoints

b3c967d

add Dask resources.

575eefd

Ryan's comments - fix challenge

ee7de64

fnattino force-pushed the parallel-raster-computations branch from 165a87e to ee7de64 Compare March 3, 2022 21:44

remove final challenge

e0a5856

fnattino requested a review from rbavery March 9, 2022 08:39