Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407

Merged
merged 6 commits into from
Jul 19, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Jul 17, 2024

Description

Contributes to #1406.

Unblocks CI.

CI in https://github.com/rapidsai/docker has been failing because of errors in the cuspatial_api_examples.ipynb notebook. Specifically, that notebook refers unconditionally to local files that are not guaranteed to exist.

To fix this, that proposes the following:

  • in the notebook, conditionally recreate those files if they don't yet exist
  • always test the notebooks in docs/ in CI here

CI here is failing because the nyc_taxi_years_correlation notebook recently started taking a prohibitively long time to run. This proposes:

  • skipping nyc_taxi_years_correlation notebook in CI

And some other small things noticed while doing all this:

  • fixing typo (rint -> ring) and argument ordering in GeoSeries.from_polygons_xy() docs

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Notes for Reviewers

I ran these notebooks on a system with CUDA 12.2, using the latest 24.8.* nightlies of RAPIDS conda packages.

@jameslamb jameslamb added bug Something isn't working non-breaking Non-breaking change labels Jul 17, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jameslamb
Copy link
Member Author

jameslamb commented Jul 17, 2024

(edited Jul 21): moved these details to a separate issue, #1413


I think that the nyc_taxi_years_correlation.ipynb notebook has just become prohibitively expensive to run in CI.

  • this PR: 2.5+ hours before I killed it (build link)
  • most recent successful run (July 11): 8m38s (build link)

Possible reasons I can think of:

  • cuspatial or cudf is now falling back to the CPU for some operation it was previously able to do entirely on the GPU
  • the input data hosted at those remote storage locations has grown significantly
  • some other performance regression in cudf, cuspatial, or their dependencies
  • some change in the CI environment

On a machine with CUDA 12.2, latest nightlies of RAPIDS libraries (conda), and 8 V100s, I reduced that notebook down to this minimal example:

imports and data setup (click me)
import cuspatial
import geopandas as gpd
import cudf
import numpy as np

# (shell commands run in notebook cells)
# !if [ ! -f "tzones_lonlat.json" ]; then curl "https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON" -o tzones_lonlat.json; else echo "tzones_lonlat.json found"; fi
# !if [ ! -f "taxi2016.csv" ]; then curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo "taxi2016.csv found"; fi   

taxi2016 = cudf.read_csv("taxi2016.csv")
tzones = gpd.GeoDataFrame.from_file('tzones_lonlat.json')
taxi_zones = cuspatial.from_geopandas(tzones).geometry
taxi_zone_rings = cuspatial.GeoSeries.from_polygons_xy(
    taxi_zones.polygons.xy,
    taxi_zones.polygons.ring_offset,
    taxi_zones.polygons.part_offset,
    cudf.Series(range(len(taxi_zones.polygons.part_offset)))
)

def make_geoseries_from_lonlat(lon, lat):
    lonlat = cudf.DataFrame({"lon": lon, "lat": lat}).interleave_columns()
    return cuspatial.GeoSeries.from_points_xy(lonlat)

pickup2016 = make_geoseries_from_lonlat(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'])
dropoff2016 = make_geoseries_from_lonlat(taxi2016['dropoff_longitude'] , taxi2016['dropoff_latitude'])

pip_iterations = list(np.arange(0, 263, 31))
pip_iterations.append(263)
print(pip_iterations)
# --- begin point-in-polygon calculation ---#
# %%time
taxi2016['PULocationID'] = 264
taxi2016['DOLocationID'] = 264

start = pip_iterations[0]
end = pip_iterations[1]

zone = taxi_zone_rings[start:end]

# find all pickups in that zone
pickups = cuspatial.point_in_polygon(pickup2016, zone)
print(pickups)
print("---")
dropoffs = cuspatial.point_in_polygon(dropoff2016, zone)
print(dropoffs)

Just that bit of code for one combination of polygons took 21 minutes to run. And in the notebook, we're running that for 10 combinations.

"pip_iterations = list(np.arange(0, 263, 31))\n",
"pip_iterations.append(263)\n",

[0, 31, 62, 93, 124, 155, 186, 217, 248, 263]

"for i in range(len(pip_iterations)-1):\n",
" start = pip_iterations[i]\n",
" end = pip_iterations[i+1]\n",

So conservatively, it might take 3.5 hours for the notebook to finish in my setup. And that's making a LOT of assumptions.

@github-actions github-actions bot added the Python Related to Python code label Jul 17, 2024
@@ -62,7 +62,7 @@
"source": [
"!if [ ! -f \"tzones_lonlat.json\" ]; then curl \"https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON\" -o tzones_lonlat.json; else echo \"tzones_lonlat.json found\"; fi\n",
"!if [ ! -f \"taxi2016.csv\" ]; then curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo \"taxi2016.csv found\"; fi \n",
"!if [ ! -f \"taxi2017.csv\" ]; then curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet -o taxi2017.parquet; else echo \"taxi2017.csv found\"; fi"
"!if [ ! -f \"taxi2017.parquet\" ]; then curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet -o taxi2017.parquet; else echo \"taxi2017.csv found\"; fi"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This -f test was checking a different filename than is actually downloaded, so it re-downloaded the data every time this cell was run.

@jameslamb jameslamb changed the title WIP: test notebooks in 'docs/', make cuspatial_api_examples self-contained test notebooks in 'docs/', make cuspatial_api_examples self-contained Jul 18, 2024
@jameslamb jameslamb marked this pull request as ready for review July 18, 2024 14:22
@jameslamb jameslamb requested review from a team as code owners July 18, 2024 14:22
rapids-bot bot pushed a commit to rapidsai/docker that referenced this pull request Jul 18, 2024
…mit, drop cuspatial notebooks (#690)

CI has been failing here for a few weeks, as a result of some failing `cuspatial` notebooks.

As described in rapidsai/cuspatial#1406, I observed those notebooks failing like this:

```text
cannot access local variable 'execution_time' where it is not associated with a value
```

Turns out that that was coming from the `test_notebooks.py` script in this repo. It calls `return execution_time` unconditionally, even though it's possible for that variable to not exist.

This fixes that, and proposes adding a `pre-commit` configuration and corresponding CI job, to help catch such things earlier in the future.

This also proposes skipping some `cuspatial` notebooks, to get CI working. I'd hoped to address that in rapidsai/cuspatial#1407, but most `cuspatial` maintainers are busy or unavailable this week. I think we should get CI working here again and deal with the `cuspatial` notebooks as a separate concern.

## Notes for Reviewers

Any other changes you see in files here are the result of running `pre-commit run --all-files` (with some manual fixes applied to fix `ruff` warnings).

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - https://github.com/jakirkham
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #690
@jakirkham jakirkham requested a review from harrism July 18, 2024 20:35
@jameslamb jameslamb changed the title test notebooks in 'docs/', make cuspatial_api_examples self-contained test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs Jul 18, 2024
@raydouglass raydouglass merged commit 8636b03 into rapidsai:branch-24.08 Jul 19, 2024
69 checks passed
@jameslamb jameslamb deleted the fix/example-notebook branch July 19, 2024 17:57
@jakirkham
Copy link
Member

Thanks all! 🙏

@harrism harrism mentioned this pull request Jul 31, 2024
3 tasks
@harrism harrism mentioned this pull request Jul 31, 2024
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Aug 1, 2024
#1407 skipped the taxi dropoffs notebook due to performance regression fixed in #1418, so this PR re-enables the notebook in CI by removing it from SKIPNBS in ci/test_noteboks.sh.

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - https://github.com/jakirkham
  - Michael Wang (https://github.com/isVoid)
  - James Lamb (https://github.com/jameslamb)

URL: #1422
rapids-bot bot pushed a commit that referenced this pull request Aug 13, 2024
The notebook-testing CI job in this repo does not actually cause a loud CI failure if any errors are detected in notebooks. That's because of a `bash` mistake I made in #1407... that PR moved notebook-checking into a function, but didn't add a `set -E` to be sure errors from inside that function were appropriately trapped.

This PR fixes that:

* ensures that notebook failures actually cause CI failures
* fixes 2 typos in `nyc_taxi_years_correlation.ipynb` code
  - *(not caught in #1422 because of this CI script bug)*

Context: #1422 (review)

## Notes for Reviewers

### How I tested this

Originally did not skip any notebooks. Saw the failures in one of the notebooks cause an actual merge-blocking CI failure.

Build link: https://github.com/rapidsai/cuspatial/actions/runs/10199784404/job/28219162698?pr=1424

#

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - Mark Harris (https://github.com/harrism)
  - https://github.com/jakirkham

URL: #1424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci conda Related to conda and conda configuration non-breaking Non-breaking change Python Related to Python code
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

3 participants