Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate azure/notebooks/Azure-MNMG-XGBoost.ipynb to deployment docs #253

Merged
merged 27 commits into from
Nov 23, 2023

Conversation

skirui-source
Copy link
Contributor

@skirui-source skirui-source commented Jul 21, 2023

Fixes #211 - migration of azure_mnmg_daskcloudprovider notebook

See #203 (comment) for detailed migration instructions.

Tasks

  • Decide if notebook should be migrated and add "migrate: X" label (if no also close this issue)
  • Test if notebook works
  • Fix up anything that needs changing
  • Ensure notebook has good title, description and metadata tags in the first cell
  • Replace deployment instructions with links to docs pages
  • Copy notebook into a new folder into deployment docs examples
  • Copy any supporting files to the folder
  • Add notebook to examples toctree
  • Make PR to deployment docs repo

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@skirui-source skirui-source marked this pull request as ready for review August 23, 2023 08:34
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skirui-source. Made a few commands in ReviewNB. It also looks like CI is failing due to some broken links. Could you fix those up?

@skirui-source
Copy link
Contributor Author

skirui-source commented Aug 30, 2023

Seeing the error below:

RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments


The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[20], line 2
      1 tic = timer()
----> 2 X_train, y_train, X_infer, y_infer = taxi_data_loader(
      3     client,
      4     adlsaccount="azureopendatastorage",
      5     adlspath="az://nyctlc/yellow/puYear=2014/puMonth=1*/*.parquet",
      6     infer_frac=0.1,
      7     random_state=42,
      8 )
      9 toc = timer()
     10 print(f"Wall clock time taken for ETL and persisting : {toc-tic} s")

Cell In[19], line 95, in taxi_data_loader(client, adlsaccount, adlspath, response_dtype, infer_frac, random_state)
     93 response_id = "fareAmount"
     94 storage_options = {"account_name": adlsaccount}
---> 95 taxi_data = dask_cudf.read_parquet(
     96     adlspath,
     97     storage_options=storage_options,
     98     chunksize=25e6,
     99     npartitions=len(workers),
    100 )
    101 taxi_data = clean(taxi_data, must_haves)
    102 taxi_data = taxi_data.map_partitions(add_features)

File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask_cudf/io/parquet.py:539, in read_parquet(path, columns, **kwargs)
    536         kwargs["read"] = {}
    537     kwargs["read"]["check_file_size"] = check_file_size
--> 539 return dd.read_parquet(path, columns=columns, engine=CudfEngine, **kwargs)

File ~/anaconda3/envs/rapids-23.08/lib/python3.10/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    136     return func(*args, **kwargs)
    137 except Exception as e:
--> 138     raise type(e)(
    139         f"An error occurred while calling the {funcname(func)} "
    140         f"method registered to the {self.backend} backend.\n"
    141         f"Original Message: {e}"
    142     ) from e

RuntimeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

@skirui-source
Copy link
Contributor Author

Resolved the ForestInference load issue, all cells now working correctly! but will need to clear all outputs
as it includes some of my personal info like email.

@skirui-source
Copy link
Contributor Author

@jacobtomlinson this PR has been ready for another review/merge. please take a look when you can.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the age of this PR I've gone ahead and pushed the fixes I would've generally suggested in a review. These changes include:

  • Fixing the broken bash code-block that was causing the build to fail.
  • Removing the --- transitions that were causing the build to fail.
  • Adding notebook metadata tags to the first cell.
  • Removing the inline styling from headings.
  • Removing heading blocks from text that weren't headings.
  • Converted bolded "Note:" sections to MyST ```{note} ... ``` admonitions.
  • Fixed a URL that wasn't a link.
  • Replaced the hard-coded container image with the {{rapids_container}} template.

I reviewed this notebook in three different ways in order to make these fixes:

  • Read and googled errors from the failing build log, this found what needed to be changed to get the build passing.
  • Opened the notebook in VSCode and reviewed the source, this identified things like the inline header styling and note sections.
  • Viewed the rendered page and workflows gallery page from the ReadTheDocs build preview, this found things like the URL that wasn't a link and missing metadata tags.

@jacobtomlinson jacobtomlinson merged commit 0b709aa into main Nov 23, 2023
4 checks passed
@jacobtomlinson jacobtomlinson deleted the migrate-azure-mnmg branch November 23, 2023 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate azure/notebooks/Azure-MNMG-XGBoost.ipynb to deployment docs
2 participants