add PMTiles notebook (#91)

furqaankhan · web-flow · commit 61f53af54919 · 2025-10-15T11:28:10.000-07:00
* add notebook

* make linter happy

* add blog link

* fix capitalization
diff --git a/Analyzing_Data/PMTiles-railroad.ipynb b/Analyzing_Data/PMTiles-railroad.ipynb
@@ -0,0 +1,350 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7de6fe5f-ec16-47f8-94d1-16aa6ca43ac4",
+   "metadata": {},
+   "source": [
+    "![Wherobots Logo](https://raw.githubusercontent.com/wherobots/wherobots-examples/refs/heads/main/assets/img/header-logo.png)\n",
+    "\n",
+    "# Generate PMTiles using Wherobots\n",
+    "\n",
+    "This notebook demonstrates how to generate a PMTiles file from the U.S. Census Bureau's TIGER railroad dataset using Wherobots.\n",
+    "\n",
+    "This notebook is part of a hands-on project that shows you how to generate and visualize PMTiles. It consists of three parts:\n",
+    "\n",
+    "1.  [**Blog Post:**](https://wherobots.com/pmtiles-rendered-in-esri-maps-api/) - A quick post that introduces and showcases this capability.\n",
+    "2.  **Jupyter Notebook (This file):** The practical, step-by-step code for generating the PMTiles file.\n",
+    "3.  [**Web Visualization Repo:**](https://github.com/wherobots/pmtiles-esri-tile-layer) - Contains a tile server and the client-side code using the **Esri JavaScript SDK** to render your PMTiles on a basemap.\n",
+    "\n",
+    "---\n",
+    "### What You'll Do in This Notebook:\n",
+    "\n",
+    "In the following cells, you will:\n",
+    "* Download and prepare the TIGER railroad shapefile, uploading it to your Wherobots Managed Storage.\n",
+    "* Filter the nationwide data for a specific region (Texas) using spatial SQL with Sedona.\n",
+    "* Generate a PMTiles file with a single command using the Wherobots `vtiles` library.\n",
+    "* Visualize the resulting map tiles directly within the notebook.\n",
+    "\n",
+    "### Cost to generate PMTiles over Texas\n",
+    "\n",
+    "* Time taken: **1m 18s**\n",
+    "* Cost: **$0.16**\n",
+    "* Runtime size: **Tiny**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "237e2a97-07ee-4926-9af4-2ba55d1bac22",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import requests\n",
+    "import zipfile\n",
+    "import io\n",
+    "import boto3\n",
+    "import wkls\n",
+    "from wherobots import vtiles\n",
+    "from urllib.parse import urlparse\n",
+    "from sedona.spark import *\n",
+    "from pyspark.sql.functions import *"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2f5eae3-6273-4ccb-a65c-461eda3ec589",
+   "metadata": {},
+   "source": [
+    "# Download the railroad dataset from TIGER\n",
+    "\n",
+    "This piece of code is a helper function that downloads the zipped folder, extracts it, and uploads it to your Managed Storage (S3 bucket).\n",
+    "\n",
+    "If the TIGER dataset's FTP server is down, we have mirrored the data in our public S3 bucket:\n",
+    "\n",
+    "`s3://wherobots-examples/data/pmtiles-blog/tl_2024_us_rails/`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4229550-ed59-43b8-a2c2-092dd30f17b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def parse_s3_uri(s3_uri):\n",
+    "    \"\"\"\n",
+    "    Parses an S3 URI (e.g., 's3://bucket-name/folder/path')\n",
+    "    and returns the bucket name and the path.\n",
+    "    \n",
+    "    Args:\n",
+    "        s3_uri (str): The S3 URI string.\n",
+    "        \n",
+    "    Returns:\n",
+    "        tuple: A tuple containing (bucket_name, folder_path).\n",
+    "    \"\"\"\n",
+    "    parsed_uri = urlparse(s3_uri)\n",
+    "    if parsed_uri.scheme != 's3':\n",
+    "        raise ValueError(\"Invalid S3 URI. Must start with 's3://'\")\n",
+    "    return parsed_uri.netloc, parsed_uri.path.lstrip('/')\n",
+    "\n",
+    "def download_and_upload_to_s3(zip_url, s3_uri):\n",
+    "    \"\"\"\n",
+    "    Downloads a zip file from a URL using requests, extracts its contents,\n",
+    "    and uploads each file to an S3 bucket specified by an S3 URI.\n",
+    "\n",
+    "    Args:\n",
+    "        zip_url (str): The URL of the zip file to download.\n",
+    "        s3_uri (str): The S3 URI (e.g., 's3://bucket-name/folder/path')\n",
+    "                      where extracted files will be uploaded.\n",
+    "    \"\"\"\n",
+    "    try:\n",
+    "        # Ignore the InsecureRequestWarning when verify=False\n",
+    "        requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)\n",
+    "\n",
+    "        # 1. Parse the S3 URI\n",
+    "        s3_bucket, s3_path_prefix = parse_s3_uri(s3_uri)\n",
+    "\n",
+    "        # 2. Download the zip file into memory, ignoring SSL certificate errors\n",
+    "        print(\"Downloading zip file...\")\n",
+    "        response = requests.get(zip_url, verify=False)\n",
+    "        response.raise_for_status()\n",
+    "        \n",
+    "        # 3. Extract and upload each file to S3\n",
+    "        zip_buffer = io.BytesIO(response.content)\n",
+    "        s3_client = boto3.client('s3')\n",
+    "        with zipfile.ZipFile(zip_buffer, 'r') as zip_file:\n",
+    "            file_list = zip_file.namelist()\n",
+    "            print(f\"Found {len(file_list)} files in the zip.\")\n",
+    "            for filename in zip_file.namelist():\n",
+    "                if not filename.endswith('/'):\n",
+    "                    with zip_file.open(filename, 'r') as file_in_zip:\n",
+    "                        file_buffer = io.BytesIO(file_in_zip.read())\n",
+    "\n",
+    "                        s3_key = f\"{s3_path_prefix}/{filename}\".lstrip('/')\n",
+    "\n",
+    "                        # Upload the file from memory to S3\n",
+    "                        print(f\"Uploading {s3_key} to {s3_bucket}...\")\n",
+    "                        s3_client.upload_fileobj(file_buffer, s3_bucket, s3_key)\n",
+    "            \n",
+    "            print(\"All files extracted and uploaded to S3 successfully!\")\n",
+    "                        \n",
+    "    except requests.exceptions.RequestException as e:\n",
+    "        print(f\"HTTP Request failed: {e}\")\n",
+    "    except zipfile.BadZipFile:\n",
+    "        print(\"The downloaded file is not a valid zip file.\")\n",
+    "    except ValueError as e:\n",
+    "        print(f\"Input error: {e}\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"An error occurred: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "90660339-7d6e-4467-854a-625ddccd32b9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zip_url = 'https://www2.census.gov/geo/tiger/TIGER2024/RAILS/tl_2024_us_rails.zip'\n",
+    "base_s3_uri = f'{os.getenv(\"USER_S3_PATH\")}PMTiles-example'\n",
+    "s3_destination_uri = f'{base_s3_uri}/data'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf7c9ebb-f5f8-460f-90c4-b24f07c007de",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "download_and_upload_to_s3(zip_url, s3_destination_uri)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cbc07c0-7e69-4685-b8f3-edd2df9d3857",
+   "metadata": {},
+   "source": [
+    "## Getting WherobotsDB started\n",
+    "\n",
+    "This gives you access to WherobotsDB and PMTiles generator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd6f9a02-0ec9-45d4-86f7-407f484feda3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = SedonaContext.builder().getOrCreate()\n",
+    "\n",
+    "sedona = SedonaContext.create(config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "45f42618-2156-485c-a35e-61afd3a65f29",
+   "metadata": {},
+   "source": [
+    "## Read in the files that we downloaded"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83ceac98-c680-4af8-be13-f1ca286ec6cd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_rail = sedona.read.format(\"shapeFile\").load(s3_destination_uri)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38152e6e-c4b8-4945-8187-6dd9b555f604",
+   "metadata": {},
+   "source": [
+    "## Filter by Texas boundary\n",
+    "\n",
+    "Feel free to alter this to some other US state or remove it entirely to get the same experience of the blog.\n",
+    "\n",
+    "The code to generate PMTiles on the entire dataset:\n",
+    "\n",
+    "```python\n",
+    "df_rail = df_rail.withColumn(\"layer\", lit(\"railroads\"))\n",
+    "```\n",
+    "\n",
+    "[Click here to learn how to select another state using the `wkls` library.](https://github.com/wherobots/wkls?tab=readme-ov-file#quick-start)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66fd0b5d-0c4d-4ee6-82d3-2c3dfa592e80",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "texas_wkt = wkls.us.tx.wkt()\n",
+    "\n",
+    "df_rail = df_rail \\\n",
+    "                .where(f\"ST_Intersects(geometry, ST_GeomFromWKT('{texas_wkt}'))\")\\\n",
+    "                .withColumn(\"layer\", lit(\"railroads\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2057fa35-90ee-488f-b35a-c058891674fc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_rail.printSchema()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee060b06-bb09-475c-a129-6b287f4163f9",
+   "metadata": {},
+   "source": [
+    "## FYI about the data\n",
+    "\n",
+    "MTFCC stands for MAF/TIGER Feature Class Code and is a code that is assigned by the U.S. Census Bureau to classify and describe geographic objects or features, such as roads, rivers, and railroad tracks. The MTFCC code `R1011` means a Railroad Feature (Main, Spur, or Yard). \n",
+    "\n",
+    "LINEARID is a Linear Feature Identifier, a unique ID number used in U.S. Census Bureau TIGER (Topologically Integrated Geographic Encoding and Referencing) data to associate a street or feature name with its location, such as an edge or address range in the spatial data. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a2a61c3d-e299-4d68-b5d9-7dee892c9898",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_rail.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0517ea33-d386-411b-9bc6-9328ec6e22d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_rail.select(\"LINEARID\").distinct().count() == df_rail.count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6018d5ab-59a9-45d1-ac1f-baf21c2a5fff",
+   "metadata": {},
+   "source": [
+    "## Generating the PMTiles\n",
+    "\n",
+    "A single line of code generates the PMTiles file from the processed DataFrame and saves it directly to your S3 bucket."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb42252d-c977-4224-a2a8-308a66037c3a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_rail.count()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffbdaa2a-d15f-481a-a779-ef0dc56e6736",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "s3_full_path = f\"{base_s3_uri}/pmtiles/railroads.pmtiles\"\n",
+    "\n",
+    "vtiles.generate_pmtiles(df_rail, s3_full_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11f22174-e707-4777-ac76-15aba426a28b",
+   "metadata": {},
+   "source": [
+    "Alternatively, you can load the PMTiles to [Wherobots hosted PMTiles viewer](https://tile-viewer.wherobots.com/) to visualize it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68bbb9c4-29cf-45b0-8fe0-01fa28fdad38",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vtiles.show_pmtiles(s3_full_path)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/README.md b/README.md
@@ -29,6 +29,7 @@ will pass as the pre-commit hooks will fix the issues it finds.
 |   |-- K_Nearest_Neighbor_Join.ipynb
 |   |-- Local_Outlier_Factor.ipynb
 |   |-- Object_Detection.ipynb
+|   |-- PMTiles-railroad.ipynb
 |   |-- Raster_Classification.ipynb
 |   |-- Raster_Segmentation.ipynb
 |   |-- Raster_Text_To_Segments_Airplanes.ipynb