Merge pull request #2129 from recommenders-team/staging

Staging to main: Fix to NewsRec, LightFM to extras, issue with scipy
recommenders-team · Jul 10, 2024 · d333a0d · d333a0d
2 parents 2f1d8ea + 3672c2e
commit d333a0d
Show file tree

Hide file tree

Showing 15 changed files with 59 additions and 22 deletions.
diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md
@@ -23,4 +23,11 @@
 <!--- * The tests for SAR PySpark should pass successfully. -->
 
 
+### Willingness to contribute
+<!--- Go over all the following points, and put an `x` in the box that apply. -->
+- [ ] Yes, I can contribute for this issue independently.
+- [ ] Yes, I can contribute for this issue with guidance from Recommenders community.
+- [ ] No, I cannot contribute at this time.
+
+
 ### Other Comments
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -28,4 +28,10 @@ assignees: ''
 <!--- For example:  -->
 <!--- * The tests for SAR PySpark should pass successfully. -->
 
+### Willingness to contribute
+<!--- Go over all the following points, and put an `x` in the box that apply. -->
+- [ ] Yes, I can contribute for this issue independently.
+- [ ] Yes, I can contribute for this issue with guidance from Recommenders community.
+- [ ] No, I cannot contribute at this time.
+
 ### Other Comments
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -14,4 +14,10 @@ assignees: ''
 <!--- For example:  -->
 <!--- *Adding algorithm xxx will help people understand more about xxx use case scenarios. -->
 
+### Willingness to contribute
+<!--- Go over all the following points, and put an `x` in the box that apply. -->
+- [ ] Yes, I can contribute for this issue independently.
+- [ ] Yes, I can contribute for this issue with guidance from Recommenders community.
+- [ ] No, I cannot contribute at this time.
+
 ### Other Comments
diff --git a/.github/ISSUE_TEMPLATE/general-ask.md b/.github/ISSUE_TEMPLATE/general-ask.md
@@ -10,4 +10,10 @@ assignees: ''
 ### Description
 <!--- Describe your general ask in detail -->
 
+### Willingness to contribute
+<!--- Go over all the following points, and put an `x` in the box that apply. -->
+- [ ] Yes, I can contribute for this issue independently.
+- [ ] Yes, I can contribute for this issue with guidance from Recommenders community.
+- [ ] No, I cannot contribute at this time.
+
 ### Other Comments
diff --git a/README.md b/README.md
@@ -94,7 +94,7 @@ The table below lists the recommendation algorithms currently available in the r
 | LightFM/Factorization Machine | Collaborative Filtering | Factorization Machine algorithm for both implicit and explicit feedbacks. It works in the CPU environment. | [Quick start](examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb) |
 | LightGBM/Gradient Boosting Tree<sup>*</sup> | Content-Based Filtering | Gradient Boosting Tree algorithm for fast training and low memory usage in content-based problems. It works in the CPU/GPU/PySpark environments. | [Quick start in CPU](examples/00_quick_start/lightgbm_tinycriteo.ipynb) / [Deep dive in PySpark](examples/02_model_content_based_filtering/mmlspark_lightgbm_criteo.ipynb) |
 | LightGCN | Collaborative Filtering | Deep learning algorithm which simplifies the design of GCN for predicting implicit feedback. It works in the CPU/GPU environment. | [Deep dive](examples/02_model_collaborative_filtering/lightgcn_deep_dive.ipynb) |
-| GeoIMC<sup>*</sup> | Collaborative Filtering | Matrix completion algorithm that has into account user and item features using Riemannian conjugate gradients optimization and following a geometric approach. It works in the CPU environment. | [Quick start](examples/00_quick_start/geoimc_movielens.ipynb) |
+| GeoIMC<sup>*</sup> | Collaborative Filtering | Matrix completion algorithm that takes into account user and item features using Riemannian conjugate gradient optimization and follows a geometric approach. It works in the CPU environment. | [Quick start](examples/00_quick_start/geoimc_movielens.ipynb) |
 | GRU | Collaborative Filtering | Sequential-based algorithm that aims to capture both long and short-term user preferences using recurrent neural networks. It works in the CPU/GPU environment. | [Quick start](examples/00_quick_start/sequential_recsys_amazondataset.ipynb) |
 | Multinomial VAE | Collaborative Filtering | Generative model for predicting user/item interactions. It works in the CPU/GPU environment. | [Deep dive](examples/02_model_collaborative_filtering/multi_vae_deep_dive.ipynb) |
 | Neural Recommendation with Long- and Short-term User Representations (LSTUR)<sup>*</sup> | Content-Based Filtering | Neural recommendation algorithm for recommending news articles with long- and short-term user interest modeling. It works in the CPU/GPU environment. | [Quick start](examples/00_quick_start/lstur_MIND.ipynb) |

diff --git a/examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb b/examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb
@@ -22,6 +22,8 @@
             "source": [
                 "This notebook explains the concept of a Factorization Machine based model for recommendation, it also outlines the steps to construct a pure matrix factorization and a Factorization Machine using the [LightFM](https://github.com/lyst/lightfm) package. It also demonstrates how to extract both user and item affinity from a fitted model.\n",
                 "\n",
+                "*NOTE: LightFM is not available in the core package of Recommenders, to run this notebook, install the experimental package with `pip install recommenders[experimental]`.*\n",
+                "\n",
                 "## 1. Factorization Machine model\n",
                 "\n",
                 "### 1.1 Background\n",

diff --git a/pyproject.toml b/pyproject.toml
@@ -2,12 +2,12 @@
 requires = [
     "setuptools>=52",
     "wheel>=0.36",
-    "numpy>=1.15",
+    "numpy>=1.15,<2",
 ]
 dependencies = [
     "setuptools>=52",
     "wheel>=0.36",
-    "numpy>=1.15",
+    "numpy>=1.15,<2",
 ]
 build-backend = "setuptools.build_meta"
 

diff --git a/recommenders/datasets/movielens.py b/recommenders/datasets/movielens.py
@@ -582,7 +582,7 @@ def unique_columns(df, *, columns):
     return not df[columns].duplicated().any()
 
 
-class MockMovielensSchema(pa.SchemaModel):
+class MockMovielensSchema(pa.DataFrameModel):
     """
     Mock dataset schema to generate fake data for testing purpose.
     This schema is configured to mimic the Movielens dataset

diff --git a/recommenders/datasets/pandas_df_utils.py b/recommenders/datasets/pandas_df_utils.py
@@ -163,7 +163,7 @@ def fit(self, df, col_rating=DEFAULT_RATING_COL):
         types = df.dtypes
         if not all(
             [
-                x == object or np.issubdtype(x, np.integer) or x == np.float
+                x == object or np.issubdtype(x, np.integer) or x == float
                 for x in types
             ]
         ):

diff --git a/recommenders/evaluation/python_evaluation.py b/recommenders/evaluation/python_evaluation.py
@@ -435,9 +435,9 @@ def merge_ranking_true_pred(
 
     # count the number of hits vs actual relevant items per user
     df_hit_count = pd.merge(
-        df_hit.groupby(col_user, as_index=False)[col_user].agg({"hit": "count"}),
+        df_hit.groupby(col_user, as_index=False)[col_user].agg(hit="count"),
         rating_true_common.groupby(col_user, as_index=False)[col_user].agg(
-            {"actual": "count"}
+            actual="count",
         ),
         on=col_user,
     )

diff --git a/recommenders/models/deeprec/DataModel/ImplicitCF.py b/recommenders/models/deeprec/DataModel/ImplicitCF.py
@@ -80,6 +80,7 @@ def _data_processing(self, train, test):
             user_idx = df[[self.col_user]].drop_duplicates().reindex()
             user_idx[self.col_user + "_idx"] = np.arange(len(user_idx))
             self.n_users = len(user_idx)
+            self.n_users_in_train = train[self.col_user].nunique()
             self.user_idx = user_idx
 
             self.user2id = dict(
@@ -210,7 +211,7 @@ def sample_neg(x):
                 if neg_id not in x:
                     return neg_id
 
-        indices = range(self.n_users)
+        indices = range(self.n_users_in_train)
         if self.n_users < batch_size:
             users = [random.choice(indices) for _ in range(batch_size)]
         else:

diff --git a/recommenders/models/newsrec/models/base_model.py b/recommenders/models/newsrec/models/base_model.py
@@ -186,6 +186,8 @@ def fit(
         valid_behaviors_file,
         test_news_file=None,
         test_behaviors_file=None,
+        step_limit=None,
+
     ):
         """Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status.
         If test_news_file is not None, evaluate it too.
@@ -212,6 +214,8 @@ def fit(
             )
 
             for batch_data_input in tqdm_util:
+                if step_limit is not None and step>=step_limit:
+                    break
 
                 step_result = self.train(batch_data_input)
                 step_data_loss = step_result

diff --git a/setup.py b/setup.py
@@ -28,22 +28,21 @@
 
 install_requires = [
     "category-encoders>=2.6.0,<3",  # requires packaging
-    "cornac>=1.15.2,<2",  # requires packaging, tqdm
+    "cornac>=1.15.2,<3",  # requires packaging, tqdm
     "hyperopt>=0.2.7,<1",
-    "lightfm>=1.17,<2",  # requires requests
     "lightgbm>=4.0.0,<5",
     "locust>=2.12.2,<3",  # requires jinja2
     "memory-profiler>=0.61.0,<1",
     "nltk>=3.8.1,<4",  # requires tqdm
-    "notebook>=7.0.0,<8",  # requires ipykernel, jinja2, jupyter, nbconvert, nbformat, packaging, requests
+    "notebook>=6.5.5,<8",  # requires ipykernel, jinja2, jupyter, nbconvert, nbformat, packaging, requests
     "numba>=0.57.0,<1",
     "pandas>2.0.0,<3.0.0",  # requires numpy
     "pandera[strategies]>=0.6.5,<0.18;python_version<='3.8'",  # For generating fake datasets
     "pandera[strategies]>=0.15.0;python_version>='3.9'",
     "retrying>=1.3.4,<2",
     "scikit-learn>=1.2.0,<2",  # requires scipy, and introduce breaking change affects feature_extraction.text.TfidfVectorizer.min_df
     "scikit-surprise>=1.1.3",
-    "scipy>=1.10.1",
+    "scipy>=1.10.1,<=1.13.1",  # FIXME: Remove scipy<=1.13.1 once cornac release a version newer than 2.2.1.  See #2128
     "seaborn>=0.13.0,<1",  # requires matplotlib, packaging
     "transformers>=4.27.0,<5",  # requires packaging, pyyaml, requests, tqdm
 ]
@@ -80,6 +79,7 @@
     # nni needs to be upgraded
     "nni==1.5",
     "pymanopt>=0.2.5",
+    "lightfm>=1.17,<2",
 ]
 
 # The following dependency can be installed as below, however PyPI does not allow direct URLs.

diff --git a/tests/ci/azureml_tests/test_groups.py b/tests/ci/azureml_tests/test_groups.py
@@ -47,8 +47,6 @@
         "tests/functional/examples/test_notebooks_python.py::test_geoimc_functional",  # 1006.19s
         #
         "tests/functional/examples/test_notebooks_python.py::test_benchmark_movielens_cpu",  # 58s
-        #
-        "tests/functional/examples/test_notebooks_python.py::test_lightfm_functional",
     ],
     "group_cpu_003": [  # Total group time: 2253s
         "tests/data_validation/recommenders/datasets/test_criteo.py::test_download_criteo_sample",  # 1.05s
@@ -237,10 +235,6 @@
         "tests/unit/recommenders/models/test_geoimc.py::test_imcproblem",
         "tests/unit/recommenders/models/test_geoimc.py::test_inferer_init",
         "tests/unit/recommenders/models/test_geoimc.py::test_inferer_infer",
-        "tests/unit/recommenders/models/test_lightfm_utils.py::test_interactions",
-        "tests/unit/recommenders/models/test_lightfm_utils.py::test_fitting",
-        "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_users",
-        "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_items",
         "tests/unit/recommenders/models/test_sar_singlenode.py::test_init",
         "tests/unit/recommenders/models/test_sar_singlenode.py::test_fit",
         "tests/unit/recommenders/models/test_sar_singlenode.py::test_predict",
@@ -453,3 +447,14 @@
         "tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm",
     ],
 }
+
+# Experimental are additional test groups that require to install extra dependencies: pip install .[experimental]
+experimental_test_groups = {
+    "group_cpu_001": [
+        "tests/unit/recommenders/models/test_lightfm_utils.py::test_interactions",
+        "tests/unit/recommenders/models/test_lightfm_utils.py::test_fitting",
+        "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_users",
+        "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_items",
+        "tests/functional/examples/test_notebooks_python.py::test_lightfm_functional",
+    ]
+}
diff --git a/tests/smoke/recommenders/recommender/test_newsrec_model.py b/tests/smoke/recommenders/recommender/test_newsrec_model.py
@@ -62,7 +62,7 @@ def test_model_nrms(mind_resource_path):
     assert model.run_eval(valid_news_file, valid_behaviors_file) is not None
     assert isinstance(
         model.fit(
-            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file
+            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file,step_limit=10
         ),
         BaseModel,
     )
@@ -115,7 +115,7 @@ def test_model_naml(mind_resource_path):
     assert model.run_eval(valid_news_file, valid_behaviors_file) is not None
     assert isinstance(
         model.fit(
-            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file
+            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file,step_limit=10
         ),
         BaseModel,
     )
@@ -166,7 +166,7 @@ def test_model_lstur(mind_resource_path):
     assert model.run_eval(valid_news_file, valid_behaviors_file) is not None
     assert isinstance(
         model.fit(
-            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file
+            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file,step_limit=10
         ),
         BaseModel,
     )
@@ -217,7 +217,7 @@ def test_model_npa(mind_resource_path):
     assert model.run_eval(valid_news_file, valid_behaviors_file) is not None
     assert isinstance(
         model.fit(
-            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file
+            train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file,step_limit=10
         ),
         BaseModel,
     )