facebookresearch · se-yi · Mar 11, 2026 · Mar 10, 2026 · Mar 11, 2026
diff --git a/README.md b/README.md
@@ -829,6 +829,8 @@ chmv2_model = torch.hub.load(
 
 Refer to this [notebook](notebooks/chmv2_inference.ipynb) for an example of how to use the DINOv3 + CHMv2 model.
 
+This [notebook](notebooks/chmv2_dataset_exploration.ipynb) can be used to download inference data from the existing global dataset stored on aws.
+
 ## License
 
 DINOv3 code and model weights are released under the DINOv3 License. See [LICENSE.md](LICENSE.md) for additional details.

diff --git a/notebooks/chmv2_dataset_exploration.ipynb b/notebooks/chmv2_dataset_exploration.ipynb
@@ -0,0 +1,246 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9ed3e340-17fd-4b71-a98e-c776aa45d053",
+   "metadata": {},
+   "source": [
+    "# Get to Know a Dataset: [Version 2 High Resolution Canopy Height Maps by WRI and Meta]\n",
+    "\n",
+    "This notebook serves as a guided tour of the [Version 2 High Resolution Canopy Height Maps by WRI and Meta](https://registry.opendata.aws/dataforgood-fb-forestsv2/) dataset. More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3779654-eeee-4708-83cf-245e03303475",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "### Q: How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket.\n",
+    "\n",
+    "\n",
+    "\n",
+    "At the top level of our S3 bucket (\"dataforgood-fb-data\"), we have a prefix \"forests/v2/global/dinov3_global_chm_v2_ml3\"  contains:\n",
+    "\n",
+    " 1. \"chm\" containing canopy height maps as cloud optimized geotiffs.\n",
+    " 2. \"metadata\" containing geojsons with observation date across the dataset.\n",
+    " 3. \"tiles.geojson\" is a geojson containing the tile extent for each tile, and the associated quadkey name.\n",
+    " \n",
+    " Full documentation for this dataset can be found at: https://arxiv.org/abs/2603.06382\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29b47b69",
+   "metadata": {},
+   "source": [
+    "First we will import the Python libraries required throughout this notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e65803f0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This notebook requires the following additional libraries\n",
+    "# (please install using the preferred method for your environment, e.g. pip, conda):\n",
+    "#\n",
+    "# boto3 >= 1.38.23\n",
+    "# matplotlib >= 3.10.3 \n",
+    "# rasterio >= 1.5.0\n",
+    "\n",
+    "# Import the libraries required for this notebook\n",
+    "# Built-ins\n",
+    "import json\n",
+    "from pprint import pprint\n",
+    "import tempfile\n",
+    "import os\n",
+    "# Installed libraries\n",
+    "import boto3, matplotlib.pyplot as plt\n",
+    "from botocore import UNSIGNED\n",
+    "from botocore.config import Config\n",
+    "import rasterio\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b14ae10",
+   "metadata": {},
+   "source": [
+    "Next, we will define the location of our dataset, create our boto3 S3 client, and list the top level prefixes in our S3 path:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "be33d211",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Location of the S3 bucket for this dataset\n",
+    "bucket = \"dataforgood-fb-data\"\n",
+    "path = \"forests/v2/global/dinov3_global_chm_v2_ml3/\"\n",
+    "\n",
+    "# List the top level of the bucket using boto3. Because this is a public bucket, we don't need to sign requests.\n",
+    "# Here we set the signature version to unsigned, which is required for public buckets.\n",
+    "s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))\n",
+    "\n",
+    "# Print the items in the top-level prefixes\n",
+    "for item in s3.list_objects_v2(Bucket=bucket, Prefix=path, Delimiter='/')['CommonPrefixes']:\n",
+    "    print(item['Prefix'])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "efb9fa4d",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "Looking into the geotiff prefix of our dataset, we see a list of .tif files, with names cooresponding to quadkey tiles at zoom_level=10.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c582a4ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path = \"forests/v2/global/dinov3_global_chm_v2_ml3/\"\n",
+    "\n",
+    "\n",
+    "# each page has a max of 1000 items\n",
+    "paginator = s3.get_paginator(\"list_objects_v2\")\n",
+    "pages = paginator.paginate(Bucket=bucket, Prefix=path)\n",
+    "\n",
+    "outlist = []\n",
+    "#only print first page here\n",
+    "for page in pages:\n",
+    "    if \"Contents\" in page.keys():\n",
+    "        objlist = [i[\"Key\"] for i in page[\"Contents\"]]\n",
+    "        outlist.extend(objlist)\n",
+    "        break\n",
+    "#we only print 10 here\n",
+    "pprint(outlist[-10:])    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd7f4bcf-ec40-432f-a31f-4477efa205ee",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "\n",
+    "\n",
+    "### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?\n",
+    "\n",
+    "\n",
+    "Our dataset comes as a set of Cloud Optimized Geotiffs:\n",
+    "\n",
+    "-  The extent of each geotiff is a zoom_level=10 [web mercator tile](https://en.wikipedia.org/wiki/Web_Mercator_projection).\n",
+    "-  The filenames are quadkeys of the containing tile.\n",
+    "-  Each geotiff contains a single data band, which represents the top of canopy height above the ground in meters.\n",
+    "-  The mask band of the geotiff is a boolean represnting where or not the input imagery has been flagged as containing a cloud.\n",
+    "-  The CRS is epsg:3857\n",
+    "\n",
+    "\n",
+    "The geojsons contain a set of polygons in a given tile. \n",
+    "- Each polygon contains a single feature value, containing a string of the observation date of the input imagery. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7362bd15",
+   "metadata": {},
+   "source": [
+    "### Q: Can you show us an example of downloading and loading data from your dataset?\n",
+    "\n",
+    "As an example, let us load up and look at one geotiff\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1fd6c00b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#download chm\n",
+    "s3file=\"forests/v2/global/dinov3_global_chm_v2_ml3/chm/0022222122.tif\"\n",
+    "with tempfile.NamedTemporaryFile(suffix=\".tif\") as dst:\n",
+    "    s3.download_file(bucket, s3file, dst.name)\n",
+    "    with rasterio.open(dst.name) as src:\n",
+    "        chm=src.read().squeeze()\n",
+    "        meta=src.meta\n",
+    "print(meta)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2856b30d-63a9-4725-a296-1af794d9d3db",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.imshow(chm[0:1000,0:1000])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "183c8b85-ed1c-4f2c-bd0e-fbfbc67c4723",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?\n",
+    "\n",
+    "We have used the data to identify relative canopy height of two nearby areas. When evaluating forest restoration and carbon stroage potential, it is useful to compare the existing state of canopy volume (ie, integrated canopy height) for a gien area, compared to the canopy valume in a mature forest nearby. \n",
+    "\n",
+    "This example highlight the strengths of the dataset (high resolution canopy height estimates, available globally), while minimizing some weaknesses (errors related to view angle, data available from a single time) by making relative (rather than absolute) measurements.\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf645724-3108-4ada-a832-10b3431eb8e2",
+   "metadata": {},
+   "source": [
+    "### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?\n",
+    "\n",
+    "\n",
+    "The connection between canopy height maps and biomass is a challenging but important link for carbon markets. Solving this problem would be valuable for not just this type of dataset, but aerial lidar datasets as well.\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}