Parcels-code · VeckoTheGecko · Feb 26, 2025 · Feb 26, 2025 · Feb 26, 2025 · Feb 26, 2025
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -39,3 +39,7 @@ repos:
           - --select=DOC103 # TODO: Expand coverage to other codes
         additional_dependencies:
           - pydoclint[flake8]
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.8.1
+    hooks:
+      - id: nbstripout
diff --git a/docs/examples/documentation_LargeRunsOutput.ipynb b/docs/examples/documentation_LargeRunsOutput.ipynb
@@ -3,7 +3,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "e2d8b054",
+   "id": "0",
    "metadata": {},
    "source": [
     "# Dealing with large output files\n"
@@ -12,7 +12,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "eb5792d0",
+   "id": "1",
    "metadata": {},
    "source": [
     "You might imagine that if you followed the instructions [on the making of parallel runs](https://docs.oceanparcels.org/en/latest/examples/documentation_MPI.html) and the loading of the resulting dataset, you could just use the `dataset.to_zarr()` function to save the data to a single `zarr` datastore. This is true for small enough datasets. However, there is a bug in `xarray` which makes this inefficient for large data sets.\n",
@@ -23,7 +23,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "4010c8d0",
+   "id": "2",
    "metadata": {},
    "source": [
     "## Why are we doing this? And what chunk sizes should we choose?\n"
@@ -32,7 +32,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "46cb22d5",
+   "id": "3",
    "metadata": {},
    "source": [
     "If you are running a relatively small case (perhaps 1/10 the size of the memory of your machine), nearly anything you do will work. However, as your problems get larger, it can help to write the data into a single zarr datastore, and to chunk that store appropriately.\n",
@@ -57,7 +57,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "0d5d3385",
+   "id": "4",
    "metadata": {},
    "source": [
     "## How to save the output of an MPI ocean parcels run to a single zarr dataset\n"
@@ -66,7 +66,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "b5001615",
+   "id": "5",
    "metadata": {},
    "source": [
     "First, we need to import the necessary modules, specify the directory `inputDir` which contains the output of the parcels run (the directory that has proc01, proc02 and so forth), the location of the output zarr file `outputDir` and a dictionary giving the chunk size for the `trajectory` and `obs` coordinates, `chunksize`.\n"
@@ -75,7 +75,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "2622a91d",
+   "id": "6",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -105,7 +105,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "33383cbe",
+   "id": "7",
    "metadata": {},
    "source": [
     "Now for large datasets, this code can take a while to run; for 36 million trajectories and 250 observations, it can take an hour and a half. I prefer not to accidentally destroy data that takes more than an hour to create, so I put in a safety check and only let the code run if the output directory does not exist.\n"
@@ -114,7 +114,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "51a1414d",
+   "id": "8",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -128,7 +128,7 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "b8818397",
+   "id": "9",
    "metadata": {},
    "source": [
     "It will often be useful to change the [dtype](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html) of the output data. Doing so can save a great deal of disk space. For example, the input data for this example is 88Gb in size, but by changing lat, lon and z to single precision, I can make the file about half as big.\n",
@@ -142,8 +142,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
-   "id": "9ca2bb15",
+   "execution_count": null,
+   "id": "10",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -158,27 +158,18 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "c26e9ba7",
+   "id": "11",
    "metadata": {},
    "source": [
     "Now we need to read in the data as discussed in the section on making an MPI run. However, note that `xr.open_zarr()` is given the `decode_times=False` option, which prevents the time variable from being converted into a datetime64[ns] object. This is necessary due to a bug in xarray. As discussed above, when the data set is read back in, time will again be interpreted as a datetime.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
-   "id": "e7dd9f61",
+   "execution_count": null,
+   "id": "12",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "opening data from multiple process files\n",
-      "   done opening in  6.37\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "print(\"opening data from multiple process files\")\n",
     "tic = time.time()\n",
@@ -195,16 +186,16 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "f93a60ff",
+   "id": "13",
    "metadata": {},
    "source": [
     "Now we can take advantage of the `.astype` operator to change the type of the variables. This is a lazy operator, and it will only be applied to the data when the data values are requested below, when the data is written to a new zarr store.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
-   "id": "6819cd84",
+   "execution_count": null,
+   "id": "14",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -215,27 +206,18 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "f8410589",
+   "id": "15",
    "metadata": {},
    "source": [
     "The dataset is then rechunked to our desired shape. This does not actually do anything right now, but will when the data is written below. Before doing this, it is useful to remove the per-variable chunking metadata, because of inconsistencies which arise due to (I think) each MPI process output having a different chunking. This is explained in more detail in https://github.com/dcs4cop/xcube/issues/347\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
-   "id": "9a56c3cc",
+   "execution_count": null,
+   "id": "16",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "re-chunking\n",
-      "   done in 9.15590238571167\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "print(\"re-chunking\")\n",
     "tic = time.time()\n",
@@ -249,26 +231,18 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "6f59018b",
+   "id": "17",
    "metadata": {},
    "source": [
     "The dataset `dataIn` is now ready to be written back out with dataIn.to_zarr(). Because this can take a while, it is nice to delay computation and then compute() the resulting object with a progress bar, so we know how long we have to get a cup of coffee or tea.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
-   "id": "de5415ed",
+   "execution_count": null,
+   "id": "18",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[########################################] | 100% Completed | 33m 9ss\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "delayedObj = dataIn.to_zarr(outputDir, compute=False)\n",
     "with ProgressBar():\n",
@@ -278,64 +252,18 @@
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "9080025f",
+   "id": "19",
    "metadata": {},
    "source": [
     "We can now load the zarr data set we have created, and see what is in it, compared to what was in the input dataset. Note that since we have not used \"decode_times=False\", the time coordinate appears as a datetime object.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
-   "id": "3157592c",
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "The original data\n",
-      " <xarray.Dataset>\n",
-      "Dimensions:     (trajectory: 39363539, obs: 250)\n",
-      "Coordinates:\n",
-      "  * obs         (obs) int32 0 1 2 3 4 5 6 7 ... 242 243 244 245 246 247 248 249\n",
-      "  * trajectory  (trajectory) int64 0 22 32 40 ... 39363210 39363255 39363379\n",
-      "Data variables:\n",
-      "    age         (trajectory, obs) float32 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
-      "    lat         (trajectory, obs) float64 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
-      "    lon         (trajectory, obs) float64 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
-      "    time        (trajectory, obs) datetime64[ns] dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
-      "    z           (trajectory, obs) float64 dask.array<chunksize=(9863, 10), meta=np.ndarray>\n",
-      "Attributes:\n",
-      "    Conventions:            CF-1.6/CF-1.7\n",
-      "    feature_type:           trajectory\n",
-      "    ncei_template_version:  NCEI_NetCDF_Trajectory_Template_v2.0\n",
-      "    parcels_mesh:           spherical\n",
-      "    parcels_version:        2.3.2.dev137 \n",
-      "\n",
-      "The new dataSet\n",
-      " <xarray.Dataset>\n",
-      "Dimensions:     (trajectory: 39363539, obs: 250)\n",
-      "Coordinates:\n",
-      "  * obs         (obs) int32 0 1 2 3 4 5 6 7 ... 242 243 244 245 246 247 248 249\n",
-      "  * trajectory  (trajectory) int64 0 22 32 40 ... 39363210 39363255 39363379\n",
-      "Data variables:\n",
-      "    age         (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
-      "    lat         (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
-      "    lon         (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
-      "    time        (trajectory, obs) datetime64[ns] dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
-      "    z           (trajectory, obs) float32 dask.array<chunksize=(50000, 10), meta=np.ndarray>\n",
-      "Attributes:\n",
-      "    Conventions:            CF-1.6/CF-1.7\n",
-      "    feature_type:           trajectory\n",
-      "    ncei_template_version:  NCEI_NetCDF_Trajectory_Template_v2.0\n",
-      "    parcels_mesh:           spherical\n",
-      "    parcels_version:        2.3.2.dev137\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "id": "20",
+   "metadata": {},
+   "outputs": [],
    "source": [
     "dataOriginal = xr.concat(\n",
     "    [xr.open_zarr(f) for f in files],\n",

diff --git a/docs/examples/documentation_MPI.ipynb b/docs/examples/documentation_MPI.ipynb