Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,7 @@ repos:
- --select=DOC103 # TODO: Expand coverage to other codes
additional_dependencies:
- pydoclint[flake8]
- repo: https://github.com/kynan/nbstripout
rev: 0.8.1
hooks:
- id: nbstripout
136 changes: 32 additions & 104 deletions docs/examples/documentation_LargeRunsOutput.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "e2d8b054",
"id": "0",
"metadata": {},
"source": [
"# Dealing with large output files\n"
Expand All @@ -12,7 +12,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "eb5792d0",
"id": "1",
"metadata": {},
"source": [
"You might imagine that if you followed the instructions [on the making of parallel runs](https://docs.oceanparcels.org/en/latest/examples/documentation_MPI.html) and the loading of the resulting dataset, you could just use the `dataset.to_zarr()` function to save the data to a single `zarr` datastore. This is true for small enough datasets. However, there is a bug in `xarray` which makes this inefficient for large data sets.\n",
Expand All @@ -23,7 +23,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "4010c8d0",
"id": "2",
"metadata": {},
"source": [
"## Why are we doing this? And what chunk sizes should we choose?\n"
Expand All @@ -32,7 +32,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "46cb22d5",
"id": "3",
"metadata": {},
"source": [
"If you are running a relatively small case (perhaps 1/10 the size of the memory of your machine), nearly anything you do will work. However, as your problems get larger, it can help to write the data into a single zarr datastore, and to chunk that store appropriately.\n",
Expand All @@ -57,7 +57,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "0d5d3385",
"id": "4",
"metadata": {},
"source": [
"## How to save the output of an MPI ocean parcels run to a single zarr dataset\n"
Expand All @@ -66,7 +66,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "b5001615",
"id": "5",
"metadata": {},
"source": [
"First, we need to import the necessary modules, specify the directory `inputDir` which contains the output of the parcels run (the directory that has proc01, proc02 and so forth), the location of the output zarr file `outputDir` and a dictionary giving the chunk size for the `trajectory` and `obs` coordinates, `chunksize`.\n"
Expand All @@ -75,7 +75,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "2622a91d",
"id": "6",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -105,7 +105,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "33383cbe",
"id": "7",
"metadata": {},
"source": [
"Now for large datasets, this code can take a while to run; for 36 million trajectories and 250 observations, it can take an hour and a half. I prefer not to accidentally destroy data that takes more than an hour to create, so I put in a safety check and only let the code run if the output directory does not exist.\n"
Expand All @@ -114,7 +114,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "51a1414d",
"id": "8",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -128,7 +128,7 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "b8818397",
"id": "9",
"metadata": {},
"source": [
"It will often be useful to change the [dtype](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html) of the output data. Doing so can save a great deal of disk space. For example, the input data for this example is 88Gb in size, but by changing lat, lon and z to single precision, I can make the file about half as big.\n",
Expand All @@ -142,8 +142,8 @@
},
{
"cell_type": "code",
"execution_count": 14,
"id": "9ca2bb15",
"execution_count": null,
"id": "10",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -158,27 +158,18 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "c26e9ba7",
"id": "11",
"metadata": {},
"source": [
"Now we need to read in the data as discussed in the section on making an MPI run. However, note that `xr.open_zarr()` is given the `decode_times=False` option, which prevents the time variable from being converted into a datetime64[ns] object. This is necessary due to a bug in xarray. As discussed above, when the data set is read back in, time will again be interpreted as a datetime.\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e7dd9f61",
"execution_count": null,
"id": "12",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"opening data from multiple process files\n",
" done opening in 6.37\n"
]
}
],
"outputs": [],
"source": [
"print(\"opening data from multiple process files\")\n",
"tic = time.time()\n",
Expand All @@ -195,16 +186,16 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "f93a60ff",
"id": "13",
"metadata": {},
"source": [
"Now we can take advantage of the `.astype` operator to change the type of the variables. This is a lazy operator, and it will only be applied to the data when the data values are requested below, when the data is written to a new zarr store.\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "6819cd84",
"execution_count": null,
"id": "14",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -215,27 +206,18 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "f8410589",
"id": "15",
"metadata": {},
"source": [
"The dataset is then rechunked to our desired shape. This does not actually do anything right now, but will when the data is written below. Before doing this, it is useful to remove the per-variable chunking metadata, because of inconsistencies which arise due to (I think) each MPI process output having a different chunking. This is explained in more detail in https://github.com/dcs4cop/xcube/issues/347\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9a56c3cc",
"execution_count": null,
"id": "16",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"re-chunking\n",
" done in 9.15590238571167\n"
]
}
],
"outputs": [],
"source": [
"print(\"re-chunking\")\n",
"tic = time.time()\n",
Expand All @@ -249,26 +231,18 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "6f59018b",
"id": "17",
"metadata": {},
"source": [
"The dataset `dataIn` is now ready to be written back out with dataIn.to_zarr(). Because this can take a while, it is nice to delay computation and then compute() the resulting object with a progress bar, so we know how long we have to get a cup of coffee or tea.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "de5415ed",
"execution_count": null,
"id": "18",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[########################################] | 100% Completed | 33m 9ss\n"
]
}
],
"outputs": [],
"source": [
"delayedObj = dataIn.to_zarr(outputDir, compute=False)\n",
"with ProgressBar():\n",
Expand All @@ -278,64 +252,18 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "9080025f",
"id": "19",
"metadata": {},
"source": [
"We can now load the zarr data set we have created, and see what is in it, compared to what was in the input dataset. Note that since we have not used \"decode_times=False\", the time coordinate appears as a datetime object.\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "3157592c",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The original data\n",
" <xarray.Dataset>\n",
"Dimensions: (trajectory: 39363539, obs: 250)\n",
"Coordinates:\n",
" * obs (obs) int32 0 1 2 3 4 5 6 7 ... 242 243 244 245 246 247 248 249\n",
" * trajectory (trajectory) int64 0 22 32 40 ... 39363210 39363255 39363379\n",
"Data variables:\n",
" age (trajectory, obs) float32 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
" lat (trajectory, obs) float64 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
" lon (trajectory, obs) float64 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
" time (trajectory, obs) datetime64[ns] dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
" z (trajectory, obs) float64 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
"Attributes:\n",
" Conventions: CF-1.6/CF-1.7\n",
" feature_type: trajectory\n",
" ncei_template_version: NCEI_NetCDF_Trajectory_Template_v2.0\n",
" parcels_mesh: spherical\n",
" parcels_version: 2.3.2.dev137 \n",
"\n",
"The new dataSet\n",
" <xarray.Dataset>\n",
"Dimensions: (trajectory: 39363539, obs: 250)\n",
"Coordinates:\n",
" * obs (obs) int32 0 1 2 3 4 5 6 7 ... 242 243 244 245 246 247 248 249\n",
" * trajectory (trajectory) int64 0 22 32 40 ... 39363210 39363255 39363379\n",
"Data variables:\n",
" age (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
" lat (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
" lon (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
" time (trajectory, obs) datetime64[ns] dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
" z (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
"Attributes:\n",
" Conventions: CF-1.6/CF-1.7\n",
" feature_type: trajectory\n",
" ncei_template_version: NCEI_NetCDF_Trajectory_Template_v2.0\n",
" parcels_mesh: spherical\n",
" parcels_version: 2.3.2.dev137\n"
]
}
],
"execution_count": null,
"id": "20",
"metadata": {},
"outputs": [],
"source": [
"dataOriginal = xr.concat(\n",
" [xr.open_zarr(f) for f in files],\n",
Expand Down
71 changes: 8 additions & 63 deletions docs/examples/documentation_MPI.ipynb

Large diffs are not rendered by default.

Loading
Loading