Issue with SliRec and DKN OOM #2076

miguelgfierro · 2024-03-25T09:26:13Z

Description

Fix #2063

I run both tests in a local V100 and they pass. In the nightly tests I get: https://github.com/recommenders-team/recommenders/actions/workflows/azureml-gpu-nightly.yml?query=branch%3Amiguel%2Fnightly_oom

The DKN test is already with 5 epochs and BS=200. I don´t know if both tests are conflicting because TF takes the whole memory of the GPU by default.

This shouldn't happen because in the BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

Related Issues

References

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
I have signed the commits, e.g. git commit -s -m "your commit message".
This PR is being made to staging branch AND NOT TO main branch.

Signed-off-by: miguelgfierro <[email protected]>

miguelgfierro · 2024-03-25T17:08:27Z

If I comment all tests except SLiRec, then it works, and it is quick:

INFO:submit_groupwise_azureml_pytest.py:Executing tests now...
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-8.1.1, pluggy-1.4.0
rootdir: /mnt/azureml/cr/j/a2150f1522ab4d47941d5730dcfc5eb8/exe/wd
configfile: pyproject.toml
plugins: mock-3.14.0, hypothesis-6.99.13, typeguard-4.2.1, cov-5.0.0, anyio-4.3.0
collected 2 items

tests/unit/examples/test_notebooks_gpu.py .                              [ 50%]
tests/functional/examples/test_notebooks_gpu.py 

============================== slowest durations ===============================
343.83s call     tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional[recommenders/models/deeprec/config/sli_rec.yaml-tests/resources/deeprec/slirec-5-300-expected_values0-42]
1.36s call     tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm

(4 durations < 0.005s hidden.  Use -vv to show these durations.)
================== 2 passed, 12 warnings in 348.30s (0:05:48) ==================
INFO:submit_groupwise_azureml_pytest.py:Test execution completed!
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 5.421173095703125 seconds

See https://github.com/recommenders-team/recommenders/actions/runs/8422556385/job/23062115526

It works with:

        "tests/functional/examples/test_notebooks_gpu.py::test_dkn_quickstart_functional", 
        "tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional",

https://github.com/recommenders-team/recommenders/actions/runs/8424096579

If I have lightgcn, SLiRec and DKN 4a484e8, I get an OOM https://github.com/recommenders-team/recommenders/actions/runs/8427004517/job/23076600230#step:3:8023

Signed-off-by: miguelgfierro <[email protected]>

miguelgfierro · 2024-04-01T15:07:16Z

@SimonYansenZhao feel free to continue in this branch #2076

The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session() but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

SimonYansenZhao · 2024-04-04T13:23:15Z

@SimonYansenZhao feel free to continue in this branch #2076

The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session() but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

Yeah, I ran each tests individually today, and all passed. I'll try to run the three tests together.

SimonYansenZhao · 2024-04-05T03:53:07Z

@miguelgfierro all tests including nightly builds passed after #2077 was merged. Not sure what's going on with the OOM issue. But I think we can merge staging into main now.

miguelgfierro · 2024-04-05T05:09:31Z

@SimonYansenZhao I´m freaking out.

miguelgfierro · 2024-04-05T05:10:09Z

Let me run again the nightlys

miguelgfierro · 2024-04-05T08:02:03Z

All nightly works.
https://github.com/recommenders-team/recommenders/actions/runs/8565349616
https://github.com/recommenders-team/recommenders/actions/runs/8565355817
https://github.com/recommenders-team/recommenders/actions/runs/8565353778

Closing this.

miguelgfierro added 2 commits March 25, 2024 08:59

SliRec 300 bs

f2941ea

Signed-off-by: miguelgfierro <[email protected]>

SliRec 300 bs 5 epochs

7020d43

Signed-off-by: miguelgfierro <[email protected]>

miguelgfierro requested review from gramhagen, anargyri, loomlike, wutaomsft and SimonYansenZhao as code owners March 25, 2024 09:26

miguelgfierro added 3 commits March 25, 2024 12:13

reorder

f1da454

Signed-off-by: miguelgfierro <[email protected]>

clear TF memory after tests

88f0e41

Signed-off-by: miguelgfierro <[email protected]>

remove all tests except SLiRec

9500bbf

Signed-off-by: miguelgfierro <[email protected]>

miguelgfierro added 7 commits March 25, 2024 18:09

remove all tests except SLiRec and DKN functional

46189f0

Signed-off-by: miguelgfierro <[email protected]>

add more SLiRec and DKN functional

190cc48

Signed-off-by: miguelgfierro <[email protected]>

add more SLiRec, lightgcn and DKN functional

63b4c5e

Signed-off-by: miguelgfierro <[email protected]>

🐛

86420c6

Signed-off-by: miguelgfierro <[email protected]>

add more SLiRec, lightgcn and DKN functional

b254918

Signed-off-by: miguelgfierro <[email protected]>

add more SLiRec, lightgcn and DKN functional

4a484e8

Signed-off-by: miguelgfierro <[email protected]>

trying only lightgcn

9a6aaa1

Signed-off-by: miguelgfierro <[email protected]>

miguelgfierro closed this Apr 5, 2024

miguelgfierro deleted the miguel/nightly_oom branch April 5, 2024 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with SliRec and DKN OOM #2076

Issue with SliRec and DKN OOM #2076

miguelgfierro commented Mar 25, 2024 •

edited

Loading

miguelgfierro commented Mar 25, 2024 •

edited

Loading

miguelgfierro commented Apr 1, 2024

SimonYansenZhao commented Apr 4, 2024

SimonYansenZhao commented Apr 5, 2024

miguelgfierro commented Apr 5, 2024

miguelgfierro commented Apr 5, 2024

miguelgfierro commented Apr 5, 2024 •

edited

Loading

Issue with SliRec and DKN OOM #2076

Issue with SliRec and DKN OOM #2076

Conversation

miguelgfierro commented Mar 25, 2024 • edited Loading

Description

Related Issues

References

Checklist:

miguelgfierro commented Mar 25, 2024 • edited Loading

miguelgfierro commented Apr 1, 2024

SimonYansenZhao commented Apr 4, 2024

SimonYansenZhao commented Apr 5, 2024

miguelgfierro commented Apr 5, 2024

miguelgfierro commented Apr 5, 2024

miguelgfierro commented Apr 5, 2024 • edited Loading

miguelgfierro commented Mar 25, 2024 •

edited

Loading

miguelgfierro commented Mar 25, 2024 •

edited

Loading

miguelgfierro commented Apr 5, 2024 •

edited

Loading