Merge pull request #921 from microsoft/staging

Staging to master
recommenders-team · Sep 10, 2019 · cadf54f · cadf54f
2 parents 20eb2a4 + c8b5200
commit cadf54f
Show file tree

Hide file tree

Showing 18 changed files with 731 additions and 315 deletions.
diff --git a/.gitignore b/.gitignore
@@ -136,14 +136,20 @@ reco_*.yaml
 *.zip
 .vscode/
 scripts/repo_metrics/config.py
-movies.dat
-ratings.dat
 u.item
-u.data
 ml-100k/
 ml-10M100K/
 ml-1m/
 ml-20m/
 *.jar
 *.item
 *.pkl
+*.txt
+*.pdf
+.pretrain
+*.npy
+*.ckpt*
+*.png
+*.jpg
+*.gif
+*.jpeg
diff --git a/AUTHORS.md b/AUTHORS.md
@@ -50,7 +50,8 @@ To contributors: please add your name to the list when you submit a patch to the
 * **[Bamdev Mishra](https://github.com/bamdevm)**
    * RLRMC algorithm
 * **[Beth Zeranski](https://github.com/bethz)**
-   * AzureML tests
+   * DevOps Pipelines used as a control plane to run existing Pytests on AzureML
+   * Automation scripts to configure AzureML environment for pipeline use 
 * **[Dan Ciborowski](https://github.com/dciborow)**
    * ALS operationalization notebook
    * SAR PySpark improvement

diff --git a/README.md b/README.md
@@ -1,13 +1,14 @@
 # Recommenders
 
-This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks: 
+This repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks: 
 - [Prepare Data](notebooks/01_prepare_data/README.md): Preparing and loading data for each recommender algorithm
 - [Model](notebooks/02_model/README.md): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) or eXtreme Deep Factorization Machines ([xDeepFM](https://arxiv.org/abs/1803.05170)).
 - [Evaluate](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
 - [Model Select and Optimize](notebooks/04_model_select_and_optimize): Tuning and optimizing hyperparameters for recommender models
 - [Operationalize](notebooks/05_operationalize/README.md): Operationalizing models in a production environment on Azure
 
-Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
+Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are included for self-study and customization in your own applications. See the [reco_utils documentation](https://readthedocs.org/projects/microsoft-recommenders/).
+
 
 For a more detailed overview of the repository, please see the documents at the [wiki page](https://github.com/microsoft/recommenders/wiki/Documents-and-Presentations).      
 
@@ -41,10 +42,24 @@ To setup on your local machine:
 
 **NOTE** - The [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) notebooks require a PySpark environment to run. Please follow the steps in the [setup guide](SETUP.md#dependencies-setup) to run these notebooks in a PySpark environment.
 
+## Install this repository via PIP
+A [setup.py](reco_utils/setup.py) file is provided in order to simplify the installation of this utilities in this repo from the main directory. 
+This still requires the conda environment to be installed as described above. Once the necessary dependencies are installed you can use the following command to install reco_utils as it's own python package.
+
+    pip install -e reco_utils
+
+It is also possible to install directly from Github. Or from a specific branch as well.
+
+    pip install -e git+https://github.com/microsoft/recommenders/#egg=pkg\&subdirectory=reco_utils
+    pip install -e git+https://github.com/microsoft/recommenders/@staging#egg=pkg\&subdirectory=reco_utils
+    
+
+**NOTE** - The pip installation does not install any of the necessary package dependencies, it is expected that conda will be used as shown above to setup the environment for the utilities being used.
+
 
 ## Algorithms
 
-The table below lists recommender algorithms currently available in the repository. Notebooks are linked under the Environment column when different implementations are available.
+The table below lists the recommender algorithms currently available in the repository. Notebooks are linked under the Environment column when different implementations are available.
 
 | Algorithm | Environment | Type | Description | 
 | --- | --- | --- | --- |
@@ -61,12 +76,11 @@ The table below lists recommender algorithms currently available in the reposito
 | Vowpal Wabbit Family (VW)<sup>*</sup> | [Python CPU (online training)](notebooks/02_model/vowpal_wabbit_deep_dive.ipynb) | Content-Based Filtering | Fast online learning algorithms, great for scenarios where user features / context are constantly changing |
 | Wide and Deep | [Python CPU / Python GPU](notebooks/00_quick_start/wide_deep_movielens.ipynb) | Hybrid | Deep learning algorithm that can memorize feature interactions and generalize user features |
 
-
 **NOTE**: <sup>*</sup> indicates algorithms invented/contributed by Microsoft.
 
 **Preliminary Comparison**
 
-We provide a [benchmark notebook](benchmarks/movielens.ipynb) to illustrate how different algorithms could be evaluated and compared. In this notebook, MovieLens dataset is splitted into training/test sets at a 75/25 ratio using a stratified split. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature [here](http://mymedialite.net/examples/datasets.html). For ranking metrics we use `k=10` (top 10 recommended items). We run the comparison on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode. In this table we show the results on Movielens 100k, running the algorithms for 15 epochs.
+We provide a [benchmark notebook](benchmarks/movielens.ipynb) to illustrate how different algorithms could be evaluated and compared. In this notebook, the MovieLens dataset is split into training/test sets at a 75/25 ratio using a stratified split. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature [here](http://mymedialite.net/examples/datasets.html). For ranking metrics we use `k=10` (top 10 recommended items). We run the comparison on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode. In this table we show the results on Movielens 100k, running the algorithms for 15 epochs.
 
 | Algo | MAP | nDCG@k | Precision@k | Recall@k | RMSE | MAE | R<sup>2</sup> | Explained Variance | 
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
@@ -77,9 +91,6 @@ We provide a [benchmark notebook](benchmarks/movielens.ipynb) to illustrate how
 | [FastAI](notebooks/00_quick_start/fastai_movielens.ipynb) | 0.025503 |	0.147866 |	0.130329 |	0.053824 | 0.943084 |	0.744337 |	0.285308 |	0.287671 |
 
 
-
-
-
 ## Contributing
 This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
 
@@ -95,5 +106,15 @@ This project welcomes contributions and suggestions. Before contributing, please
 | **Windows GPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_win?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6756) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_staging_win?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6761) |
 | **Windows Spark** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_win?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6757) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_staging_win?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6754) |
 
+## AzureML Build Status 
+
+These DevOps pipelines run the existing tests on AzureML.
+
+| Build Type | Branch | Status |  | Branch | Status |  
+| --- | --- | --- | --- | --- | --- |  
+| **nightly_cpu_tests** | master | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/nightly_cpu_tests?branchName=master)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=25&branchName=master) |   | Staging | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/nightly_cpu_tests?branchName=staging)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=25&branchName=staging) | 
+| **nightly_gpu_tests** | master | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/bp-nightly_gpu_tests?branchName=master)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=5&branchName=master) |   | Staging | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/bp-nightly_gpu_tests?branchName=staging)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=5&branchName=staging) |
+
+
 **NOTE** - these tests are the nightly builds, which compute the smoke and integration tests. Master is our main branch and staging is our development branch. We use `pytest` for testing python utilities in [reco_utils](reco_utils) and `papermill` for the [notebooks](notebooks). For more information about the testing pipelines, please see the [test documentation](tests/README.md).
 
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,35 @@
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.1 BLOCK -->
+
+## Security
+
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [many more](https://opensource.microsoft.com/).
+
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [definition](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.** Instead, please report them to the Microsoft Security Response Center at [[email protected]](mailto:[email protected]).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://technet.microsoft.com/en-us/security/dn606155).
+
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+
+This information will help us triage your report more quickly.
+
+## Preferred Languages
+
+We prefer all communications to be in English.
+
+## Policy
+
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
+
+<!-- END MICROSOFT SECURITY.MD BLOCK -->
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -88,7 +88,7 @@ FROM gpu AS full
 ARG HOME
 WORKDIR ${HOME}
 
-COPY --from=pyspark spark spark
+COPY --from=pyspark ${HOME}/spark spark
 
 # Setup Conda environment
 RUN python recommenders/scripts/generate_conda_file.py --name base --gpu --pyspark

diff --git a/notebooks/01_prepare_data/README.md b/notebooks/01_prepare_data/README.md
@@ -6,14 +6,16 @@ data preparation tasks witnessed in recommendation system development.
 
 | Notebook | Description | 
 | --- | --- | 
-| [data_split](data_split.ipynb) | Details on splitting data (randomly, chronologically, etc).
-| [data_transform](data_transform.ipynb) | Guidance on how to transform (implicit / explicit) data for building collaborative filtering typed recommender.
+| [data_split](data_split.ipynb) | Details on splitting data (randomly, chronologically, etc). |
+| [data_transform](data_transform.ipynb) | Guidance on how to transform (implicit / explicit) data for building collaborative filtering typed recommender. |
+| [wikidata knowledge graph](wikidata_KG.ipynb) | Details on how to create a knowledge graph using Wikidata |
+
+### Data split
 
-** Data split
 Three methods of splitting the data for training and testing are demonstrated in this notebook. Each supports both Spark and pandas DataFrames.
 1. Random Split: this is the simplest way to split the data, it randomly assigns entries to either the train set or the test set based on the allocation ratio desired.
 2. Chronological Split: in many cases accounting for temporal variations when evaluating your model can provide more realistic measures of performance. This approach will split the train and test set based on timestamps by user or item.
 3. Stratified Split: it may be preferable to ensure the same set of users or items are in the train and test sets, this method of splitting will ensure that is the case.
 
-** Data transform
+###  Data transform
 Data transformation techniques which are commonly used in various recommendation scenarios are introduced and reviewed.