-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Categorify #1692
Refactor Categorify #1692
Conversation
Click to view CI ResultsGitHub pull request #1692 of commit a79a37e2979f11f10a0ad02af9711b466974a00f, no merge conflicts. Running as SYSTEM Setting status of a79a37e2979f11f10a0ad02af9711b466974a00f to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4748/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1692/*:refs/remotes/origin/pr/1692/* # timeout=10 > git rev-parse a79a37e2979f11f10a0ad02af9711b466974a00f^{commit} # timeout=10 Checking out Revision a79a37e2979f11f10a0ad02af9711b466974a00f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f a79a37e2979f11f10a0ad02af9711b466974a00f # timeout=10 Commit message: "uuse np.arange(... like)" > git rev-list --no-walk e91cf945a5ce3f36729309b3ac333181c6604bcb # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins6871348817336990919.sh GLOB sdist-make: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/setup.py test-gpu create: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/test-gpu test-gpu installdeps: pytest, pytest-cov WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu inst: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/.tmp/package/1/nvtabular-1.5.0+21.ga79a37e29.zip WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu installed: absl-py==1.2.0,aiohttp==3.8.1,aiosignal==1.2.0,alabaster==0.7.12,anyio==3.6.1,argon2-cffi==21.3.0,argon2-cffi-bindings==21.2.0,astroid==2.5.6,asttokens==2.0.8,astunparse==1.6.3,asv==0.5.1,asvdb==0.4.2,async-timeout==4.0.2,attrs==22.1.0,awscli==1.25.90,Babel==2.10.3,backcall==0.2.0,beautifulsoup4==4.11.1,betterproto==1.2.5,black==22.6.0,bleach==5.0.1,boto3==1.24.75,botocore==1.27.89,Brotli==1.0.9,cachetools==5.2.0,certifi==2019.11.28,cffi==1.15.1,chardet==3.0.4,charset-normalizer==2.1.1,clang==5.0,click==8.1.3,cloudpickle==2.2.0,cmake==3.24.1.1,colorama==0.4.4,contourpy==1.0.5,coverage==6.5.0,cuda-python==11.7.1,cupy-cuda117==10.6.0,cycler==0.11.0,Cython==0.29.32,dask==2022.1.1,dbus-python==1.2.16,debugpy==1.6.3,decorator==5.1.1,defusedxml==0.7.1,dill==0.3.5.1,distlib==0.3.6,distributed==2022.5.1,distro==1.7.0,dm-tree==0.1.6,docker-pycreds==0.4.0,docutils==0.16,emoji==1.7.0,entrypoints==0.4,execnet==1.9.0,executing==1.0.0,faiss==1.7.2,faiss-gpu==1.7.2,fastai==2.7.9,fastapi==0.85.0,fastavro==1.6.1,fastcore==1.5.27,fastdownload==0.0.7,fastjsonschema==2.16.1,fastprogress==1.0.3,fastrlock==0.8,feast==0.19.4,fiddle==0.2.2,filelock==3.8.0,flatbuffers==1.12,fonttools==4.37.3,frozenlist==1.3.1,fsspec==2022.5.0,gast==0.4.0,gevent==21.12.0,geventhttpclient==2.0.2,gitdb==4.0.9,GitPython==3.1.27,google==3.0.0,google-api-core==2.10.1,google-auth==2.11.1,google-auth-oauthlib==0.4.6,google-pasta==0.2.0,googleapis-common-protos==1.52.0,graphviz==0.20.1,greenlet==1.1.3,grpcio==1.41.0,grpcio-channelz==1.49.0,grpcio-reflection==1.48.1,grpclib==0.4.3,h11==0.13.0,h2==4.1.0,h5py==3.7.0,HeapDict==1.0.1,hpack==4.0.0,httptools==0.5.0,hugectr2onnx==0.0.0,huggingface-hub==0.9.1,hyperframe==6.0.1,idna==2.8,imagesize==1.4.1,implicit==0.6.1,importlib-metadata==4.12.0,importlib-resources==5.9.0,iniconfig==1.1.1,ipykernel==6.15.3,ipython==8.5.0,ipython-genutils==0.2.0,ipywidgets==7.7.0,jedi==0.18.1,Jinja2==3.1.2,jmespath==1.0.1,joblib==1.2.0,json5==0.9.10,jsonschema==4.16.0,jupyter-cache==0.4.3,jupyter-core==4.11.1,jupyter-server==1.18.1,jupyter-server-mathjax==0.2.5,jupyter-sphinx==0.3.2,jupyter_client==7.3.5,jupyterlab==3.4.7,jupyterlab-pygments==0.2.2,jupyterlab-widgets==1.1.0,jupyterlab_server==2.15.1,keras==2.9.0,Keras-Preprocessing==1.1.2,kiwisolver==1.4.4,lazy-object-proxy==1.7.1,libclang==14.0.6,libcst==0.4.7,lightfm==1.16,lightgbm==3.3.2,linkify-it-py==1.0.3,llvmlite==0.39.1,locket==1.0.0,lxml==4.9.1,Markdown==3.4.1,markdown-it-py==1.1.0,MarkupSafe==2.1.1,matplotlib==3.6.0,matplotlib-inline==0.1.6,mdit-py-plugins==0.2.8,merlin-core==0.6.0+1.g5926fcf,merlin-models==0.7.0+11.g280956aa4,merlin-systems==0.5.0+4.g15074ad,mistune==2.0.4,mmh3==3.0.0,mpi4py==3.1.3,msgpack==1.0.4,multidict==6.0.2,mypy-extensions==0.4.3,myst-nb==0.13.2,myst-parser==0.15.2,natsort==8.1.0,nbclassic==0.4.3,nbclient==0.6.8,nbconvert==7.0.0,nbdime==3.1.1,nbformat==5.5.0,nest-asyncio==1.5.5,ninja==1.10.2.3,notebook==6.4.12,notebook-shim==0.1.0,numba==0.56.2,numpy==1.22.4,nvidia-pyindex==1.0.9,-e git+https://github.com/NVIDIA-Merlin/NVTabular.git@a79a37e2979f11f10a0ad02af9711b466974a00f#egg=nvtabular,nvtx==0.2.5,oauthlib==3.2.1,oldest-supported-numpy==2022.8.16,onnx==1.12.0,onnxruntime==1.11.1,opt-einsum==3.3.0,packaging==21.3,pandas==1.3.5,pandavro==1.5.2,pandocfilters==1.5.0,parso==0.8.3,partd==1.3.0,pathtools==0.1.2,pexpect==4.8.0,pickleshare==0.7.5,Pillow==9.2.0,pkgutil_resolve_name==1.3.10,platformdirs==2.5.2,pluggy==1.0.0,prometheus-client==0.14.1,promise==2.3,prompt-toolkit==3.0.31,proto-plus==1.19.6,protobuf==3.19.5,psutil==5.9.2,ptyprocess==0.7.0,pure-eval==0.2.2,py==1.11.0,pyarrow==7.0.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pybind11==2.10.0,pycparser==2.21,pydantic==1.10.2,pydot==1.4.2,Pygments==2.13.0,PyGObject==3.36.0,pynvml==11.4.1,pyparsing==3.0.9,pyrsistent==0.18.1,pytest==7.1.3,pytest-cov==4.0.0,pytest-forked==1.4.0,pytest-xdist==2.5.0,python-apt==2.0.0+ubuntu0.20.4.8,python-dateutil==2.8.2,python-dotenv==0.21.0,python-rapidjson==1.8,pytz==2022.2.1,PyYAML==5.4.1,pyzmq==24.0.0,regex==2022.9.13,requests==2.22.0,requests-oauthlib==1.3.1,requests-unixsocket==0.2.0,rsa==4.7.2,s3fs==2022.2.0,s3transfer==0.6.0,sacremoses==0.0.53,scikit-build==0.15.0,scikit-learn==1.1.2,scipy==1.9.1,seedir==0.3.0,Send2Trash==1.8.0,sentry-sdk==1.9.8,setproctitle==1.3.2,setuptools-scm==7.0.5,shortuuid==1.0.9,six==1.15.0,sklearn==0.0,smmap==5.0.0,sniffio==1.3.0,snowballstemmer==2.2.0,sortedcontainers==2.4.0,soupsieve==2.3.2.post1,Sphinx==5.2.3,sphinx-multiversion==0.2.4,sphinx-togglebutton==0.3.1,sphinx_external_toc==0.3.0,sphinxcontrib-applehelp==1.0.2,sphinxcontrib-copydirs @ git+https://github.com/mikemckiernan/sphinxcontrib-copydirs.git@bd8c5d79b3f91cf5f1bb0d6995aeca3fe84b670e,sphinxcontrib-devhelp==1.0.2,sphinxcontrib-htmlhelp==2.0.0,sphinxcontrib-jsmath==1.0.1,sphinxcontrib-qthelp==1.0.3,sphinxcontrib-serializinghtml==1.1.5,SQLAlchemy==1.4.36,stack-data==0.5.0,starlette==0.20.4,stringcase==1.2.0,supervisor==4.1.0,tabulate==0.8.10,tblib==1.7.0,tdqm==0.0.1,tenacity==8.0.1,tensorboard==2.9.1,tensorboard-data-server==0.6.1,tensorboard-plugin-wit==1.8.1,tensorflow==2.6.2,tensorflow-estimator==2.9.0,tensorflow-gpu==2.9.2,tensorflow-io-gcs-filesystem==0.27.0,tensorflow-metadata==1.10.0,termcolor==2.0.1,terminado==0.15.0,testbook==0.4.2,threadpoolctl==3.1.0,tinycss2==1.1.1,tokenizers==0.10.3,toml==0.10.2,tomli==2.0.1,toolz==0.12.0,torch==1.12.1+cu113,torchmetrics==0.3.2,tornado==6.2,tox==3.26.0,tqdm==4.64.1,traitlets==5.4.0,transformers==4.12.0,transformers4rec==0.1.12+2.gbcc939255,treelite==2.3.0,treelite-runtime==2.3.0,tritonclient==2.25.0,typing-inspect==0.8.0,typing_extensions==4.3.0,uc-micro-py==1.0.1,urllib3==1.26.12,uvicorn==0.18.3,uvloop==0.17.0,versioneer==0.20,virtualenv==20.16.5,wandb==0.13.3,watchfiles==0.17.0,wcwidth==0.2.5,webencodings==0.5.1,websocket-client==1.4.1,websockets==10.3,Werkzeug==2.2.2,widgetsnbextension==3.6.0,wrapt==1.12.1,xgboost==1.6.2,yarl==1.8.1,zict==2.2.0,zipp==3.8.1,zope.event==4.5.0,zope.interface==5.4.0 test-gpu run-test-pre: PYTHONHASHSEED='156367538' test-gpu run-test: commands[0] | python -m pip install --upgrade git+https://github.com/NVIDIA-Merlin/core.git Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Collecting git+https://github.com/NVIDIA-Merlin/core.git Cloning https://github.com/NVIDIA-Merlin/core.git to /tmp/pip-req-build-2gphxscx Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA-Merlin/core.git /tmp/pip-req-build-2gphxscx Resolved https://github.com/NVIDIA-Merlin/core.git to commit 14a18dc0de5d5fd7737ecbadf9f6d7fa5d801b67 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (21.3) Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (1.3.5) Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (3.19.5) Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (4.64.1) Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (1.2.5) Requirement already satisfied: fsspec==2022.5.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (2022.5.0) Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (7.0.0) Requirement already satisfied: dask>=2022.3.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (2022.3.0) Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (1.10.0) Requirement already satisfied: distributed>=2022.3.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (2022.3.0) Requirement already satisfied: numba>=0.54 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (0.55.1) Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (1.2.0) Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (0.4.3) Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (0.12.0) Requirement already satisfied: partd>=0.3.10 in /var/jenkins_home/.local/lib/python3.8/site-packages/partd-1.2.0-py3.8.egg (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.2.0) Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.2.0) Requirement already satisfied: pyyaml>=5.3.1 in /var/jenkins_home/.local/lib/python3.8/site-packages/PyYAML-5.4.1-py3.8-linux-x86_64.egg (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (5.4.1) Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /var/jenkins_home/.local/lib/python3.8/site-packages/sortedcontainers-2.4.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.4.0) Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (8.1.3) Requirement already satisfied: psutil>=5.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/psutil-5.8.0-py3.8-linux-x86_64.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (5.8.0) Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.0.4) Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (3.1.2) Requirement already satisfied: tblib>=1.6.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/tblib-1.7.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.7.0) Requirement already satisfied: tornado>=6.0.3 in /var/jenkins_home/.local/lib/python3.8/site-packages/tornado-6.1-py3.8-linux-x86_64.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (6.1) Requirement already satisfied: zict>=0.1.3 in /var/jenkins_home/.local/lib/python3.8/site-packages/zict-2.0.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.0.0) Requirement already satisfied: numpy<1.22,>=1.18 in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.7.0+9.g14a18dc) (1.20.3) Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in ./.tox/test-gpu/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.7.0+9.g14a18dc) (0.38.1) Requirement already satisfied: setuptools in ./.tox/test-gpu/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.7.0+9.g14a18dc) (65.3.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->merlin-core==0.7.0+9.g14a18dc) (3.0.9) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (2022.2.1) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (2.8.2) Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (1.52.0) Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (1.2.0) Requirement already satisfied: locket in /var/jenkins_home/.local/lib/python3.8/site-packages/locket-0.2.1-py3.8.egg (from partd>=0.3.10->dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (0.2.1) Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (1.15.0) Requirement already satisfied: heapdict in /var/jenkins_home/.local/lib/python3.8/site-packages/HeapDict-1.0.1-py3.8.egg (from zict>=0.1.3->distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.0.1) Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (4.1.0) Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (6.0.2) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.1.1) Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (4.0.0) Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (6.0.1) Building wheels for collected packages: merlin-core Building wheel for merlin-core (pyproject.toml): started Building wheel for merlin-core (pyproject.toml): finished with status 'done' Created wheel for merlin-core: filename=merlin_core-0.7.0+9.g14a18dc-py3-none-any.whl size=118253 sha256=50deeab28814e390243922ff18c7e10f9245f8fc8253231cf8b31d6a35b46721 Stored in directory: /tmp/pip-ephem-wheel-cache-i6179l5j/wheels/c8/38/16/a6968787eafcec5fa772148af8408b089562f71af0752e8e84 Successfully built merlin-core Installing collected packages: merlin-core Attempting uninstall: merlin-core Found existing installation: merlin-core 0.3.0+12.g78ecddd Not uninstalling merlin-core at /var/jenkins_home/.local/lib/python3.8/site-packages, outside environment /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/test-gpu Can't uninstall 'merlin-core'. No files were found to uninstall. Successfully installed merlin-core-0.7.0+9.g14a18dc test-gpu run-test: commands[1] | python -m pytest --cov-report term --cov merlin -rxs tests/unit ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 cachedir: .tox/test-gpu/.pytest_cache rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1441 items / 1 skipped |
Click to view CI ResultsGitHub pull request #1692 of commit 79886ef0e32859edcc0e9d840b45c14c268074f5, no merge conflicts. Running as SYSTEM Setting status of 79886ef0e32859edcc0e9d840b45c14c268074f5 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4749/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1692/*:refs/remotes/origin/pr/1692/* # timeout=10 > git rev-parse 79886ef0e32859edcc0e9d840b45c14c268074f5^{commit} # timeout=10 Checking out Revision 79886ef0e32859edcc0e9d840b45c14c268074f5 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 79886ef0e32859edcc0e9d840b45c14c268074f5 # timeout=10 Commit message: "Merge branch 'main' into refactor-categorify-2" > git rev-list --no-walk a79a37e2979f11f10a0ad02af9711b466974a00f # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins15613217120649781849.sh GLOB sdist-make: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/setup.py test-gpu create: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/test-gpu test-gpu installdeps: pytest, pytest-cov WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu inst: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/.tmp/package/1/nvtabular-1.5.0+23.g79886ef0e.zip WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu installed: absl-py==1.2.0,aiohttp==3.8.1,aiosignal==1.2.0,alabaster==0.7.12,anyio==3.6.1,argon2-cffi==21.3.0,argon2-cffi-bindings==21.2.0,astroid==2.5.6,asttokens==2.0.8,astunparse==1.6.3,asv==0.5.1,asvdb==0.4.2,async-timeout==4.0.2,attrs==22.1.0,awscli==1.25.90,Babel==2.10.3,backcall==0.2.0,beautifulsoup4==4.11.1,betterproto==1.2.5,black==22.6.0,bleach==5.0.1,boto3==1.24.75,botocore==1.27.89,Brotli==1.0.9,cachetools==5.2.0,certifi==2019.11.28,cffi==1.15.1,chardet==3.0.4,charset-normalizer==2.1.1,clang==5.0,click==8.1.3,cloudpickle==2.2.0,cmake==3.24.1.1,colorama==0.4.4,contourpy==1.0.5,coverage==6.5.0,cuda-python==11.7.1,cupy-cuda117==10.6.0,cycler==0.11.0,Cython==0.29.32,dask==2022.1.1,dbus-python==1.2.16,debugpy==1.6.3,decorator==5.1.1,defusedxml==0.7.1,dill==0.3.5.1,distlib==0.3.6,distributed==2022.5.1,distro==1.7.0,dm-tree==0.1.6,docker-pycreds==0.4.0,docutils==0.16,emoji==1.7.0,entrypoints==0.4,execnet==1.9.0,executing==1.0.0,faiss==1.7.2,faiss-gpu==1.7.2,fastai==2.7.9,fastapi==0.85.0,fastavro==1.6.1,fastcore==1.5.27,fastdownload==0.0.7,fastjsonschema==2.16.1,fastprogress==1.0.3,fastrlock==0.8,feast==0.19.4,fiddle==0.2.2,filelock==3.8.0,flatbuffers==1.12,fonttools==4.37.3,frozenlist==1.3.1,fsspec==2022.5.0,gast==0.4.0,gevent==21.12.0,geventhttpclient==2.0.2,gitdb==4.0.9,GitPython==3.1.27,google==3.0.0,google-api-core==2.10.1,google-auth==2.11.1,google-auth-oauthlib==0.4.6,google-pasta==0.2.0,googleapis-common-protos==1.52.0,graphviz==0.20.1,greenlet==1.1.3,grpcio==1.41.0,grpcio-channelz==1.49.0,grpcio-reflection==1.48.1,grpclib==0.4.3,h11==0.13.0,h2==4.1.0,h5py==3.7.0,HeapDict==1.0.1,hpack==4.0.0,httptools==0.5.0,hugectr2onnx==0.0.0,huggingface-hub==0.9.1,hyperframe==6.0.1,idna==2.8,imagesize==1.4.1,implicit==0.6.1,importlib-metadata==4.12.0,importlib-resources==5.9.0,iniconfig==1.1.1,ipykernel==6.15.3,ipython==8.5.0,ipython-genutils==0.2.0,ipywidgets==7.7.0,jedi==0.18.1,Jinja2==3.1.2,jmespath==1.0.1,joblib==1.2.0,json5==0.9.10,jsonschema==4.16.0,jupyter-cache==0.4.3,jupyter-core==4.11.1,jupyter-server==1.18.1,jupyter-server-mathjax==0.2.5,jupyter-sphinx==0.3.2,jupyter_client==7.3.5,jupyterlab==3.4.7,jupyterlab-pygments==0.2.2,jupyterlab-widgets==1.1.0,jupyterlab_server==2.15.1,keras==2.9.0,Keras-Preprocessing==1.1.2,kiwisolver==1.4.4,lazy-object-proxy==1.7.1,libclang==14.0.6,libcst==0.4.7,lightfm==1.16,lightgbm==3.3.2,linkify-it-py==1.0.3,llvmlite==0.39.1,locket==1.0.0,lxml==4.9.1,Markdown==3.4.1,markdown-it-py==1.1.0,MarkupSafe==2.1.1,matplotlib==3.6.0,matplotlib-inline==0.1.6,mdit-py-plugins==0.2.8,merlin-core==0.6.0+1.g5926fcf,merlin-models==0.7.0+11.g280956aa4,merlin-systems==0.5.0+4.g15074ad,mistune==2.0.4,mmh3==3.0.0,mpi4py==3.1.3,msgpack==1.0.4,multidict==6.0.2,mypy-extensions==0.4.3,myst-nb==0.13.2,myst-parser==0.15.2,natsort==8.1.0,nbclassic==0.4.3,nbclient==0.6.8,nbconvert==7.0.0,nbdime==3.1.1,nbformat==5.5.0,nest-asyncio==1.5.5,ninja==1.10.2.3,notebook==6.4.12,notebook-shim==0.1.0,numba==0.56.2,numpy==1.22.4,nvidia-pyindex==1.0.9,-e git+https://github.com/NVIDIA-Merlin/NVTabular.git@79886ef0e32859edcc0e9d840b45c14c268074f5#egg=nvtabular,nvtx==0.2.5,oauthlib==3.2.1,oldest-supported-numpy==2022.8.16,onnx==1.12.0,onnxruntime==1.11.1,opt-einsum==3.3.0,packaging==21.3,pandas==1.3.5,pandavro==1.5.2,pandocfilters==1.5.0,parso==0.8.3,partd==1.3.0,pathtools==0.1.2,pexpect==4.8.0,pickleshare==0.7.5,Pillow==9.2.0,pkgutil_resolve_name==1.3.10,platformdirs==2.5.2,pluggy==1.0.0,prometheus-client==0.14.1,promise==2.3,prompt-toolkit==3.0.31,proto-plus==1.19.6,protobuf==3.19.5,psutil==5.9.2,ptyprocess==0.7.0,pure-eval==0.2.2,py==1.11.0,pyarrow==7.0.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pybind11==2.10.0,pycparser==2.21,pydantic==1.10.2,pydot==1.4.2,Pygments==2.13.0,PyGObject==3.36.0,pynvml==11.4.1,pyparsing==3.0.9,pyrsistent==0.18.1,pytest==7.1.3,pytest-cov==4.0.0,pytest-forked==1.4.0,pytest-xdist==2.5.0,python-apt==2.0.0+ubuntu0.20.4.8,python-dateutil==2.8.2,python-dotenv==0.21.0,python-rapidjson==1.8,pytz==2022.2.1,PyYAML==5.4.1,pyzmq==24.0.0,regex==2022.9.13,requests==2.22.0,requests-oauthlib==1.3.1,requests-unixsocket==0.2.0,rsa==4.7.2,s3fs==2022.2.0,s3transfer==0.6.0,sacremoses==0.0.53,scikit-build==0.15.0,scikit-learn==1.1.2,scipy==1.9.1,seedir==0.3.0,Send2Trash==1.8.0,sentry-sdk==1.9.8,setproctitle==1.3.2,setuptools-scm==7.0.5,shortuuid==1.0.9,six==1.15.0,sklearn==0.0,smmap==5.0.0,sniffio==1.3.0,snowballstemmer==2.2.0,sortedcontainers==2.4.0,soupsieve==2.3.2.post1,Sphinx==5.2.3,sphinx-multiversion==0.2.4,sphinx-togglebutton==0.3.1,sphinx_external_toc==0.3.0,sphinxcontrib-applehelp==1.0.2,sphinxcontrib-copydirs @ git+https://github.com/mikemckiernan/sphinxcontrib-copydirs.git@bd8c5d79b3f91cf5f1bb0d6995aeca3fe84b670e,sphinxcontrib-devhelp==1.0.2,sphinxcontrib-htmlhelp==2.0.0,sphinxcontrib-jsmath==1.0.1,sphinxcontrib-qthelp==1.0.3,sphinxcontrib-serializinghtml==1.1.5,SQLAlchemy==1.4.36,stack-data==0.5.0,starlette==0.20.4,stringcase==1.2.0,supervisor==4.1.0,tabulate==0.8.10,tblib==1.7.0,tdqm==0.0.1,tenacity==8.0.1,tensorboard==2.9.1,tensorboard-data-server==0.6.1,tensorboard-plugin-wit==1.8.1,tensorflow==2.6.2,tensorflow-estimator==2.9.0,tensorflow-gpu==2.9.2,tensorflow-io-gcs-filesystem==0.27.0,tensorflow-metadata==1.10.0,termcolor==2.0.1,terminado==0.15.0,testbook==0.4.2,threadpoolctl==3.1.0,tinycss2==1.1.1,tokenizers==0.10.3,toml==0.10.2,tomli==2.0.1,toolz==0.12.0,torch==1.12.1+cu113,torchmetrics==0.3.2,tornado==6.2,tox==3.26.0,tqdm==4.64.1,traitlets==5.4.0,transformers==4.12.0,transformers4rec==0.1.12+2.gbcc939255,treelite==2.3.0,treelite-runtime==2.3.0,tritonclient==2.25.0,typing-inspect==0.8.0,typing_extensions==4.3.0,uc-micro-py==1.0.1,urllib3==1.26.12,uvicorn==0.18.3,uvloop==0.17.0,versioneer==0.20,virtualenv==20.16.5,wandb==0.13.3,watchfiles==0.17.0,wcwidth==0.2.5,webencodings==0.5.1,websocket-client==1.4.1,websockets==10.3,Werkzeug==2.2.2,widgetsnbextension==3.6.0,wrapt==1.12.1,xgboost==1.6.2,yarl==1.8.1,zict==2.2.0,zipp==3.8.1,zope.event==4.5.0,zope.interface==5.4.0 test-gpu run-test-pre: PYTHONHASHSEED='620898182' test-gpu run-test: commands[0] | python -m pip install --upgrade git+https://github.com/NVIDIA-Merlin/core.git Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Collecting git+https://github.com/NVIDIA-Merlin/core.git Cloning https://github.com/NVIDIA-Merlin/core.git to /tmp/pip-req-build-zpu16iu7 Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA-Merlin/core.git /tmp/pip-req-build-zpu16iu7 Resolved https://github.com/NVIDIA-Merlin/core.git to commit 14a18dc0de5d5fd7737ecbadf9f6d7fa5d801b67 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Requirement already satisfied: dask>=2022.3.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (2022.3.0) Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (1.3.5) Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (1.10.0) Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (7.0.0) Requirement already satisfied: fsspec==2022.5.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (2022.5.0) Requirement already satisfied: numba>=0.54 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (0.55.1) Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (21.3) Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (3.19.5) Requirement already satisfied: distributed>=2022.3.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.7.0+9.g14a18dc) (2022.3.0) Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (1.2.5) Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.7.0+9.g14a18dc) (4.64.1) Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (0.4.3) Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (1.2.0) Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.2.0) Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (0.12.0) Requirement already satisfied: partd>=0.3.10 in /var/jenkins_home/.local/lib/python3.8/site-packages/partd-1.2.0-py3.8.egg (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.2.0) Requirement already satisfied: pyyaml>=5.3.1 in /var/jenkins_home/.local/lib/python3.8/site-packages/PyYAML-5.4.1-py3.8-linux-x86_64.egg (from dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (5.4.1) Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.0.4) Requirement already satisfied: zict>=0.1.3 in /var/jenkins_home/.local/lib/python3.8/site-packages/zict-2.0.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.0.0) Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (8.1.3) Requirement already satisfied: psutil>=5.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/psutil-5.8.0-py3.8-linux-x86_64.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (5.8.0) Requirement already satisfied: tornado>=6.0.3 in /var/jenkins_home/.local/lib/python3.8/site-packages/tornado-6.1-py3.8-linux-x86_64.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (6.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (3.1.2) Requirement already satisfied: tblib>=1.6.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/tblib-1.7.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.7.0) Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /var/jenkins_home/.local/lib/python3.8/site-packages/sortedcontainers-2.4.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.4.0) Requirement already satisfied: numpy<1.22,>=1.18 in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.7.0+9.g14a18dc) (1.20.3) Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in ./.tox/test-gpu/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.7.0+9.g14a18dc) (0.38.1) Requirement already satisfied: setuptools in ./.tox/test-gpu/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.7.0+9.g14a18dc) (65.3.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->merlin-core==0.7.0+9.g14a18dc) (3.0.9) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (2022.2.1) Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (1.52.0) Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (1.2.0) Requirement already satisfied: locket in /var/jenkins_home/.local/lib/python3.8/site-packages/locket-0.2.1-py3.8.egg (from partd>=0.3.10->dask>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (0.2.1) Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->merlin-core==0.7.0+9.g14a18dc) (1.15.0) Requirement already satisfied: heapdict in /var/jenkins_home/.local/lib/python3.8/site-packages/HeapDict-1.0.1-py3.8.egg (from zict>=0.1.3->distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (1.0.1) Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (4.1.0) Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (6.0.2) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2022.3.0->merlin-core==0.7.0+9.g14a18dc) (2.1.1) Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (4.0.0) Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core==0.7.0+9.g14a18dc) (6.0.1) Building wheels for collected packages: merlin-core Building wheel for merlin-core (pyproject.toml): started Building wheel for merlin-core (pyproject.toml): finished with status 'done' Created wheel for merlin-core: filename=merlin_core-0.7.0+9.g14a18dc-py3-none-any.whl size=118253 sha256=b1346a6831e7bbadeb97888405cc54228b50abe4775e9a31f43a9c0cb619298e Stored in directory: /tmp/pip-ephem-wheel-cache-ob6nej2l/wheels/c8/38/16/a6968787eafcec5fa772148af8408b089562f71af0752e8e84 Successfully built merlin-core Installing collected packages: merlin-core Attempting uninstall: merlin-core Found existing installation: merlin-core 0.3.0+12.g78ecddd Not uninstalling merlin-core at /var/jenkins_home/.local/lib/python3.8/site-packages, outside environment /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/test-gpu Can't uninstall 'merlin-core'. No files were found to uninstall. Successfully installed merlin-core-0.7.0+9.g14a18dc test-gpu run-test: commands[1] | python -m pytest --cov-report term --cov merlin -rxs tests/unit ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 cachedir: .tox/test-gpu/.pytest_cache rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1441 items / 1 skipped |
Click to view CI ResultsGitHub pull request #1692 of commit d8455893cf8198953324236c70df277521b6108a, no merge conflicts. Running as SYSTEM Setting status of d8455893cf8198953324236c70df277521b6108a to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4762/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1692/*:refs/remotes/origin/pr/1692/* # timeout=10 > git rev-parse d8455893cf8198953324236c70df277521b6108a^{commit} # timeout=10 Checking out Revision d8455893cf8198953324236c70df277521b6108a (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d8455893cf8198953324236c70df277521b6108a # timeout=10 Commit message: "improve test coverage" > git rev-list --no-walk 43381fbced6bcc08be660276c8f72d5b1d9c6029 # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins7404145749028268135.sh GLOB sdist-make: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/setup.py test-gpu create: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/test-gpu test-gpu installdeps: pytest, pytest-cov WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu inst: /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/.tmp/package/1/nvtabular-1.6.0+22.gd8455893c.zip WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu installed: absl-py==1.2.0,aiohttp==3.8.1,aiosignal==1.2.0,alabaster==0.7.12,anyio==3.6.1,argon2-cffi==21.3.0,argon2-cffi-bindings==21.2.0,astroid==2.5.6,asttokens==2.0.8,astunparse==1.6.3,asv==0.5.1,asvdb==0.4.2,async-timeout==4.0.2,attrs==22.1.0,awscli==1.26.2,Babel==2.10.3,backcall==0.2.0,beautifulsoup4==4.11.1,betterproto==1.2.5,black==22.6.0,bleach==5.0.1,boto3==1.24.75,botocore==1.28.2,Brotli==1.0.9,cachetools==5.2.0,certifi==2019.11.28,cffi==1.15.1,chardet==3.0.4,charset-normalizer==2.1.1,clang==5.0,click==8.1.3,cloudpickle==2.2.0,cmake==3.24.1.1,colorama==0.4.4,contourpy==1.0.5,coverage==6.5.0,cuda-python==11.7.1,cupy-cuda117==10.6.0,cycler==0.11.0,Cython==0.29.32,dask==2022.1.1,dbus-python==1.2.16,debugpy==1.6.3,decorator==5.1.1,defusedxml==0.7.1,dill==0.3.5.1,distlib==0.3.6,distributed==2022.5.1,distro==1.7.0,dm-tree==0.1.6,docker-pycreds==0.4.0,docutils==0.16,emoji==1.7.0,entrypoints==0.4,execnet==1.9.0,executing==1.0.0,faiss==1.7.2,faiss-gpu==1.7.2,fastai==2.7.9,fastapi==0.85.0,fastavro==1.6.1,fastcore==1.5.27,fastdownload==0.0.7,fastjsonschema==2.16.1,fastprogress==1.0.3,fastrlock==0.8,feast==0.19.4,fiddle==0.2.2,filelock==3.8.0,flatbuffers==1.12,fonttools==4.37.3,frozenlist==1.3.1,fsspec==2022.5.0,gast==0.4.0,gevent==21.12.0,geventhttpclient==2.0.2,gitdb==4.0.9,GitPython==3.1.27,google==3.0.0,google-api-core==2.10.1,google-auth==2.11.1,google-auth-oauthlib==0.4.6,google-pasta==0.2.0,googleapis-common-protos==1.52.0,graphviz==0.20.1,greenlet==1.1.3,grpcio==1.41.0,grpcio-channelz==1.49.0,grpcio-reflection==1.48.1,grpclib==0.4.3,h11==0.13.0,h2==4.1.0,h5py==3.7.0,HeapDict==1.0.1,horovod==0.26.1,hpack==4.0.0,httptools==0.5.0,hugectr2onnx==0.0.0,huggingface-hub==0.9.1,hyperframe==6.0.1,idna==2.8,imagesize==1.4.1,implicit==0.6.1,importlib-metadata==4.12.0,importlib-resources==5.9.0,iniconfig==1.1.1,ipykernel==6.15.3,ipython==8.5.0,ipython-genutils==0.2.0,ipywidgets==7.7.0,jedi==0.18.1,Jinja2==3.1.2,jmespath==1.0.1,joblib==1.2.0,json5==0.9.10,jsonschema==4.16.0,jupyter-cache==0.4.3,jupyter-core==4.11.1,jupyter-server==1.18.1,jupyter-server-mathjax==0.2.5,jupyter-sphinx==0.3.2,jupyter_client==7.3.5,jupyterlab==3.4.7,jupyterlab-pygments==0.2.2,jupyterlab-widgets==1.1.0,jupyterlab_server==2.15.1,keras==2.9.0,Keras-Preprocessing==1.1.2,kiwisolver==1.4.4,lazy-object-proxy==1.8.0,libclang==14.0.6,libcst==0.4.7,lightfm==1.16,lightgbm==3.3.2,linkify-it-py==1.0.3,llvmlite==0.39.1,locket==1.0.0,lxml==4.9.1,Markdown==3.4.1,markdown-it-py==1.1.0,MarkupSafe==2.1.1,matplotlib==3.6.0,matplotlib-inline==0.1.6,mdit-py-plugins==0.2.8,merlin-core==0.6.0+1.g5926fcf,merlin-models==0.7.0+11.g280956aa4,merlin-systems==0.5.0+4.g15074ad,mistune==2.0.4,mmh3==3.0.0,mpi4py==3.1.3,msgpack==1.0.4,multidict==6.0.2,mypy-extensions==0.4.3,myst-nb==0.13.2,myst-parser==0.15.2,natsort==8.1.0,nbclassic==0.4.3,nbclient==0.6.8,nbconvert==7.0.0,nbdime==3.1.1,nbformat==5.5.0,nest-asyncio==1.5.5,ninja==1.10.2.3,notebook==6.4.12,notebook-shim==0.1.0,numba==0.56.2,numpy==1.22.4,nvidia-pyindex==1.0.9,-e git+https://github.com/NVIDIA-Merlin/NVTabular.git@d8455893cf8198953324236c70df277521b6108a#egg=nvtabular,nvtx==0.2.5,oauthlib==3.2.1,oldest-supported-numpy==2022.8.16,onnx==1.12.0,onnxruntime==1.11.1,opt-einsum==3.3.0,packaging==21.3,pandas==1.3.5,pandavro==1.5.2,pandocfilters==1.5.0,parso==0.8.3,partd==1.3.0,pathtools==0.1.2,pexpect==4.8.0,pickleshare==0.7.5,Pillow==9.2.0,pkgutil_resolve_name==1.3.10,platformdirs==2.5.2,pluggy==1.0.0,prometheus-client==0.14.1,promise==2.3,prompt-toolkit==3.0.31,proto-plus==1.19.6,protobuf==3.19.5,psutil==5.9.2,ptyprocess==0.7.0,pure-eval==0.2.2,py==1.11.0,pyarrow==7.0.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pybind11==2.10.0,pycparser==2.21,pydantic==1.10.2,pydot==1.4.2,Pygments==2.13.0,PyGObject==3.36.0,pynvml==11.4.1,pyparsing==3.0.9,pyrsistent==0.18.1,pytest==7.1.3,pytest-cov==4.0.0,pytest-xdist==3.0.2,python-apt==2.0.0+ubuntu0.20.4.8,python-dateutil==2.8.2,python-dotenv==0.21.0,python-rapidjson==1.8,pytz==2022.2.1,PyYAML==5.4.1,pyzmq==24.0.0,regex==2022.9.13,requests==2.22.0,requests-oauthlib==1.3.1,requests-unixsocket==0.2.0,rsa==4.7.2,s3fs==2022.2.0,s3transfer==0.6.0,sacremoses==0.0.53,scikit-build==0.15.0,scikit-learn==1.1.2,scipy==1.9.1,seedir==0.3.0,Send2Trash==1.8.0,sentry-sdk==1.9.8,setproctitle==1.3.2,setuptools-scm==7.0.5,shortuuid==1.0.9,six==1.15.0,sklearn==0.0,smmap==5.0.0,sniffio==1.3.0,snowballstemmer==2.2.0,sortedcontainers==2.4.0,soupsieve==2.3.2.post1,Sphinx==5.3.0,sphinx-multiversion==0.2.4,sphinx-togglebutton==0.3.1,sphinx_external_toc==0.3.0,sphinxcontrib-applehelp==1.0.2,sphinxcontrib-copydirs @ git+https://github.com/mikemckiernan/sphinxcontrib-copydirs.git@bd8c5d79b3f91cf5f1bb0d6995aeca3fe84b670e,sphinxcontrib-devhelp==1.0.2,sphinxcontrib-htmlhelp==2.0.0,sphinxcontrib-jsmath==1.0.1,sphinxcontrib-qthelp==1.0.3,sphinxcontrib-serializinghtml==1.1.5,SQLAlchemy==1.4.36,stack-data==0.5.0,starlette==0.20.4,stringcase==1.2.0,supervisor==4.1.0,tabulate==0.8.10,tblib==1.7.0,tdqm==0.0.1,tenacity==8.0.1,tensorboard==2.9.1,tensorboard-data-server==0.6.1,tensorboard-plugin-wit==1.8.1,tensorflow==2.9.2,tensorflow-estimator==2.9.0,tensorflow-gpu==2.9.2,tensorflow-io-gcs-filesystem==0.27.0,tensorflow-metadata==1.10.0,termcolor==2.0.1,terminado==0.15.0,testbook==0.4.2,threadpoolctl==3.1.0,tinycss2==1.1.1,tokenizers==0.10.3,toml==0.10.2,tomli==2.0.1,toolz==0.12.0,torch==1.12.1+cu113,torchmetrics==0.3.2,tornado==6.2,tox==3.26.0,tqdm==4.64.1,traitlets==5.4.0,transformers==4.12.0,transformers4rec==0.1.12+2.gbcc939255,treelite==2.3.0,treelite-runtime==2.3.0,tritonclient==2.25.0,typing-inspect==0.8.0,typing_extensions==4.3.0,uc-micro-py==1.0.1,urllib3==1.26.12,uvicorn==0.18.3,uvloop==0.17.0,versioneer==0.20,virtualenv==20.16.5,wandb==0.13.3,watchfiles==0.17.0,wcwidth==0.2.5,webencodings==0.5.1,websocket-client==1.4.1,websockets==10.3,Werkzeug==2.2.2,widgetsnbextension==3.6.0,wrapt==1.12.1,xgboost==1.6.2,yarl==1.8.1,zict==2.2.0,zipp==3.8.1,zope.event==4.5.0,zope.interface==5.4.0 test-gpu run-test-pre: PYTHONHASHSEED='3771611534' test-gpu run-test: commands[0] | python -m pip install --upgrade git+https://github.com/NVIDIA-Merlin/core.git Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Collecting git+https://github.com/NVIDIA-Merlin/core.git Cloning https://github.com/NVIDIA-Merlin/core.git to /tmp/pip-req-build-minx4e2t Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA-Merlin/core.git /tmp/pip-req-build-minx4e2t Resolved https://github.com/NVIDIA-Merlin/core.git to commit 2c621a26a2b1b7ed786c99bd7c2790b9fc675098 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Requirement already satisfied: distributed>=2022.3.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.8.0+3.g2c621a2) (2022.3.0) Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.8.0+3.g2c621a2) (3.19.5) Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.8.0+3.g2c621a2) (1.10.0) Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.8.0+3.g2c621a2) (7.0.0) Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.8.0+3.g2c621a2) (1.2.5) Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.8.0+3.g2c621a2) (1.3.5) Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.8.0+3.g2c621a2) (4.64.1) Requirement already satisfied: numba>=0.54 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.8.0+3.g2c621a2) (0.55.1) Requirement already satisfied: fsspec==2022.5.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.8.0+3.g2c621a2) (2022.5.0) Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from merlin-core==0.8.0+3.g2c621a2) (21.3) Requirement already satisfied: dask>=2022.3.0 in /var/jenkins_home/.local/lib/python3.8/site-packages (from merlin-core==0.8.0+3.g2c621a2) (2022.3.0) Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core==0.8.0+3.g2c621a2) (1.2.0) Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core==0.8.0+3.g2c621a2) (0.4.3) Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (2.2.0) Requirement already satisfied: partd>=0.3.10 in /var/jenkins_home/.local/lib/python3.8/site-packages/partd-1.2.0-py3.8.egg (from dask>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (1.2.0) Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (0.12.0) Requirement already satisfied: pyyaml>=5.3.1 in /var/jenkins_home/.local/lib/python3.8/site-packages/PyYAML-5.4.1-py3.8-linux-x86_64.egg (from dask>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (5.4.1) Requirement already satisfied: psutil>=5.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/psutil-5.8.0-py3.8-linux-x86_64.egg (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (5.8.0) Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (8.1.3) Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (3.1.2) Requirement already satisfied: zict>=0.1.3 in /var/jenkins_home/.local/lib/python3.8/site-packages/zict-2.0.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (2.0.0) Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /var/jenkins_home/.local/lib/python3.8/site-packages/sortedcontainers-2.4.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (2.4.0) Requirement already satisfied: tblib>=1.6.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/tblib-1.7.0-py3.8.egg (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (1.7.0) Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (1.0.4) Requirement already satisfied: tornado>=6.0.3 in /var/jenkins_home/.local/lib/python3.8/site-packages/tornado-6.1-py3.8-linux-x86_64.egg (from distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (6.1) Requirement already satisfied: numpy<1.22,>=1.18 in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.8.0+3.g2c621a2) (1.20.3) Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in ./.tox/test-gpu/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.8.0+3.g2c621a2) (0.38.1) Requirement already satisfied: setuptools in ./.tox/test-gpu/lib/python3.8/site-packages (from numba>=0.54->merlin-core==0.8.0+3.g2c621a2) (65.4.1) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->merlin-core==0.8.0+3.g2c621a2) (3.0.9) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core==0.8.0+3.g2c621a2) (2022.2.1) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core==0.8.0+3.g2c621a2) (2.8.2) Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core==0.8.0+3.g2c621a2) (1.2.0) Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core==0.8.0+3.g2c621a2) (1.52.0) Requirement already satisfied: locket in /var/jenkins_home/.local/lib/python3.8/site-packages/locket-0.2.1-py3.8.egg (from partd>=0.3.10->dask>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (0.2.1) Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->merlin-core==0.8.0+3.g2c621a2) (1.15.0) Requirement already satisfied: heapdict in /var/jenkins_home/.local/lib/python3.8/site-packages/HeapDict-1.0.1-py3.8.egg (from zict>=0.1.3->distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (1.0.1) Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core==0.8.0+3.g2c621a2) (6.0.2) Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core==0.8.0+3.g2c621a2) (4.1.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2022.3.0->merlin-core==0.8.0+3.g2c621a2) (2.1.1) Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core==0.8.0+3.g2c621a2) (6.0.1) Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core==0.8.0+3.g2c621a2) (4.0.0) Building wheels for collected packages: merlin-core Building wheel for merlin-core (pyproject.toml): started Building wheel for merlin-core (pyproject.toml): finished with status 'done' Created wheel for merlin-core: filename=merlin_core-0.8.0+3.g2c621a2-py3-none-any.whl size=118254 sha256=67b6084de7dd068d8ac6efb760b652a9eeb85a64f9521e7f019212533d6a656c Stored in directory: /tmp/pip-ephem-wheel-cache-9xevb5sf/wheels/c8/38/16/a6968787eafcec5fa772148af8408b089562f71af0752e8e84 Successfully built merlin-core Installing collected packages: merlin-core Attempting uninstall: merlin-core Found existing installation: merlin-core 0.3.0+12.g78ecddd Not uninstalling merlin-core at /var/jenkins_home/.local/lib/python3.8/site-packages, outside environment /var/jenkins_home/workspace/nvtabular_tests/nvtabular/.tox/test-gpu Can't uninstall 'merlin-core'. No files were found to uninstall. Successfully installed merlin-core-0.8.0+3.g2c621a2 |
# Warn user if they set num_buckets without setting max_size or | ||
# freq_threshold - This setting used to hash everything, but will | ||
# now just use multiple indices for OOV encodings at transform time | ||
if num_buckets and not (max_size or freq_threshold): | ||
warnings.warn( | ||
"You are setting num_buckets without using max_size or " | ||
"freq_threshold to restrict the number of distinct " | ||
"categories. Are you sure this is what you want?" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rnyak - This is one detail I am still unsure about. I really don't want to make it easy for users to "hash everything" (what currently happens when the user only sets num_buckets
). However, it is still strange for the user to *only set num_buckets
(now for a slightly-different reason). My issue with the original behavior is that there is really no reason for Catogorify
to be used at all for the case of "hash everything," because we are doing a lot of unnecessary work that can be accomplished by a simple lambda op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjzamora I am ok to take num_buckets
out of Categorify. So far, we dont know a Merlin user who only wants to use num_buckets
(IIRC). Not sure anyone wants to really hash entire item catalog only with num_buckets.. then how they will reverse back to original ids is a question mark. but I leave the decision to you and @karlhigley. And as you said if this can be easily done with a lambda op, we can just do it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong opinions here; removing it sounds fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok to take num_buckets out of Categorify
Just to clarify, I'm not really proposing that we remove the num_buckets
option, but that we slightly change it's meaning to only do what it used to do when max_size
or freq_threshold
was set:
- OLD meaning: We hash everything, UNLESS
max_size
orfreq_threshold
is also set (in which case we usenum_buckets
to be the number of hash-buckets to use for infrequent OOV values at transformation time). - NEW meaning: We always take
num_buckets
to be the number of hash-buckets to use for infrequent OOV values at transformation time. Even if neithermax_size
norfreq_threshold
is set, we will still use a hash+modulo for OOV values at transformation time ifnum_buckets>1
.
Note: When ready, this PR should supersede #1687
Current Algorithm
The current
Categorify
implementation was not designed for extremely large datasets. It was also not designed to handle categorical columns containing too many unique values to fit in a singlecudf.DataFrame
. The problem with large-scale datasets is the fact that thefit
code does not use a proper tree reduction. The problem with high-cardinality columns is the fact that all unique groups must be written to (and read from) disk using a singlecudf.DataFrame
.In both cases, the
fit
stage ofCategorify
is what limits scalability the most. The existingfit
algorithm is as follows:cudf/pandas.DataFrame
results. Each dictionary value corresponds to the local groupby result for a distinct categorical-column + column-hash combination. The number of hash-based "splits" to use for each categorical column depends on thetree_width
argument. In the figure above, there is only one categorical column (for simplicity), andtree_width=3
.tree_width
. (Potential for OOM: Large-scale datasets may require that a significant amount of data be aggregated in each of these tasks - Even with a largetree_width
setting)New Algorithm
This PR modifies the existing
Categorify
algorithm in two important ways: First, we perform a "proper" tree reduction to avoid OOM errors for large-scale datasets. Second, we persist unique values as a directory of parquet files (rather than a single parquet file) for high-cardinality columns. The general algorithm for the newCategorify.fit
is as follows:ddf.to_parquet(..., compute=False)
, and theDelayed
result is used in place of the "Level 2" keys as the "Level 3" task dependency.Important limitations: The new algorithm will currently "fall back" to writing a single Parquet file for "complex"
Categorify
options (e.g. multi-column joint/combo encoding). The number of supported cases can certaibly be improved. However, the primary target for this PR is single-column encoding.