diff --git a/docs/NET-API.md b/docs/NET-API.md
deleted file mode 100644
index ea8c6dfc..00000000
--- a/docs/NET-API.md
+++ /dev/null
@@ -1,89 +0,0 @@
-The .Net API allows you connect to OpenML from .Net applications.
-
-## Download
-
-Stable releases of the .Net API are available via [NuGet](https://www.nuget.org/packages/openMl). Use the NuGet package explorer in the Visual Studia, write “Install-Package openMl” to the NuGet package manager console or download the whole package from the NuGet website and add it into your project. Or, you can check out the developer version from [GitHub](https://github.com/openml/dotnet).
-
-### Quick Start
-
-Create an `OpenmlConnector` instance with your api key. You can find this key in your account settings. This will create a client with OpenML functionalities, The functionalities mirror the OpenMlApi and not all of them are (yet) implemented. If you need some feature, don’t hesitate contact us via our Git page.
-
-
-
-
`var connector = new OpenMlConnector("YOURAPIKEY");`
-
-
-
-All OpenMlConnector methods are documented via the usual .Net comments.
-
-#### Get dataset description
-
-
diff --git a/docs/REST-API.md b/docs/REST-API.md
deleted file mode 100644
index 038f8ecf..00000000
--- a/docs/REST-API.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# REST APIs
-
-The REST API allows you to talk directly to the OpenML server from any programming environment.
-
-The REST API has two parts (with different endpoints):
-
-### The Main REST API
-Has all main functions to download OpenML data or share your own.
-API Documentation
-
-### The File API
-Serves datasets and other files stored on OpenML servers.
-File API Documentation
diff --git a/docs/automl/AutoML-Benchmark.md b/docs/automl/AutoML-Benchmark.md
new file mode 100644
index 00000000..a8cd7ed7
--- /dev/null
+++ b/docs/automl/AutoML-Benchmark.md
@@ -0,0 +1,86 @@
+---
+title: Getting Started
+description: A short tutorial on installing the software and running a simple benchmark.
+---
+
+# Getting Started
+
+The [AutoML Benchmark](https://openml.github.io/automlbenchmark/index.html) is a tool for benchmarking AutoML frameworks on tabular data.
+It automates the installation of AutoML frameworks, passing it data, and evaluating
+their predictions.
+[Our paper](https://arxiv.org/pdf/2207.12560.pdf) describes the design and showcases
+results from an evaluation using the benchmark.
+This guide goes over the minimum steps needed to evaluate an
+AutoML framework on a toy dataset.
+
+Full instructions can be found in the [API Documentation.](https://openml.github.io/automlbenchmark/docs/)
+
+## Installation
+These instructions assume that [Python 3.9 (or higher)](https://www.python.org/downloads/)
+and [git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) are installed,
+and are available under the alias `python` and `git`, respectively. We recommend
+[Pyenv](https://github.com/pyenv/pyenv) for managing multiple Python installations,
+if applicable. We support Ubuntu 22.04, but many linux and MacOS versions likely work
+(for MacOS, it may be necessary to have [`brew`](https://brew.sh) installed).
+
+First, clone the repository:
+
+```bash
+git clone https://github.com/openml/automlbenchmark.git --branch stable --depth 1
+cd automlbenchmark
+```
+
+Create a virtual environments to install the dependencies in:
+
+### Linux
+
+```bash
+python -m venv venv
+source venv/bin/activate
+```
+
+### MacOS
+
+```bash
+python -m venv venv
+source venv/bin/activate
+```
+
+### Windows
+
+```bash
+python -m venv ./venv
+venv/Scripts/activate
+```
+
+Then install the dependencies:
+
+```bash
+python -m pip install --upgrade pip
+python -m pip install -r requirements.txt
+```
+
+
+??? windows "Note for Windows users"
+
+ The automated installation of AutoML frameworks is done using shell script,
+ which doesn't work on Windows. We recommend you use
+ [Docker](https://docs.docker.com/desktop/install/windows-install/) to run the
+ examples below. First, install and run `docker`.
+ Then, whenever there is a `python runbenchmark.py ...`
+ command in the tutorial, add `-m docker` to it (`python runbenchmark.py ... -m docker`).
+
+??? question "Problem with the installation?"
+
+ On some platforms, we need to ensure that requirements are installed sequentially.
+ Use `xargs -L 1 python -m pip install < requirements.txt` to do so. If problems
+ persist, [open an issue](https://github.com/openml/automlbenchmark/issues/new) with
+ the error and information about your environment (OS, Python version, pip version).
+
+
+## Running the Benchmark
+
+To run a benchmark call the `runbenchmark.py` script specifying the framework to evaluate.
+
+See the [API Documentation.](https://openml.github.io/automlbenchmark/docs/) for more information on the parameters available.
+
diff --git a/docs/automl/basic_example.md b/docs/automl/basic_example.md
new file mode 100644
index 00000000..848b16fe
--- /dev/null
+++ b/docs/automl/basic_example.md
@@ -0,0 +1,127 @@
+# Random Forest Baseline
+
+Let's try evaluating the `RandomForest` baseline, which uses [scikit-learn](https://scikit-learn.org/stable/)'s random forest:
+## Running the Benchmark
+### Linux
+
+```bash
+python runbenchmark.py randomforest
+```
+
+### MacOS
+
+```bash
+python runbenchmark.py randomforest
+```
+
+### Windows
+As noted above, we need to install the AutoML frameworks (and baselines) in
+a container. Add `-m docker` to the command as shown:
+```bash
+python runbenchmark.py randomforest -m docker
+```
+
+!!! warning "Important"
+ Future example usages will only show invocations without `-m docker` mode,
+ but Windows users will need to run in some non-local mode.
+
+## Results
+After running the command, there will be a lot of output to the screen that reports
+on what is currently happening. After a few minutes final results are shown and should
+look similar to this:
+
+```
+Summing up scores for current run:
+ id task fold framework constraint result metric duration seed
+openml.org/t/3913 kc2 0 RandomForest test 0.865801 auc 11.1 851722466
+openml.org/t/3913 kc2 1 RandomForest test 0.857143 auc 9.1 851722467
+ openml.org/t/59 iris 0 RandomForest test -0.120755 neg_logloss 8.7 851722466
+ openml.org/t/59 iris 1 RandomForest test -0.027781 neg_logloss 8.5 851722467
+openml.org/t/2295 cholesterol 0 RandomForest test -44.220800 neg_rmse 8.7 851722466
+openml.org/t/2295 cholesterol 1 RandomForest test -55.216500 neg_rmse 8.7 851722467
+```
+
+The result denotes the performance of the framework on the test data as measured by
+the metric listed in the metric column. The result column always denotes performance
+in a way where higher is better (metrics which normally observe "lower is better" are
+converted, which can be observed from the `neg_` prefix).
+
+While running the command, the AutoML benchmark performed the following steps:
+
+ 1. Create a new virtual environment for the Random Forest experiment.
+ This environment can be found in `frameworks/randomforest/venv` and will be re-used
+ when you perform other experiments with `RandomForest`.
+ 2. It downloaded datasets from [OpenML](https://www.openml.org) complete with a
+ "task definition" which specifies [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) folds.
+ 3. It evaluated `RandomForest` on each (task, fold)-combination in a separate subprocess, where:
+ 1. The framework (`RandomForest`) is initialized.
+ 2. The training data is passed to the framework for training.
+ 3. The test data is passed to the framework to make predictions on.
+ 4. It passes the predictions back to the main process
+ 4. The predictions are evaluated and reported on. They are printed to the console and
+ are stored in the `results` directory. There you will find:
+ 1. `results/results.csv`: a file with all results from all benchmarks conducted on your machine.
+ 2. `results/randomforest.test.test.local.TIMESTAMP`: a directory with more information about the run,
+ such as logs, predictions, and possibly other artifacts.
+
+!!! info "Docker Mode"
+
+ When using docker mode (with `-m docker`) a docker image will be made that contains
+ the virtual environment. Otherwise, it functions much the same way.
+
+## Important Parameters
+
+As you can see from the results above, the default behavior is to execute a short test
+benchmark. However, we can specify a different benchmark, provide different constraints,
+and even run the experiment in a container or on AWS. There are many parameters
+for the `runbenchmark.py` script, but the most important ones are:
+
+### Framework (required)
+
+- The AutoML framework or baseline to evaluate and is not case-sensitive. See
+ [integrated frameworks](WEBSITE/frameworks.html) for a list of supported frameworks.
+ In the above example, this benchmarked framework `randomforest`.
+
+### Benchmark (optional, default='test')
+
+- The benchmark suite is the dataset or set of datasets to evaluate the framework on.
+ These can be defined as on [OpenML](https://www.openml.org) as a [study or task](extending/benchmark.md#defining-a-benchmark-on-openml)
+ (formatted as `openml/s/X` or `openml/t/Y` respectively) or in a [local file](extending/benchmark.md#defining-a-benchmark-with-a-file).
+ The default is a short evaluation on two folds of `iris`, `kc2`, and `cholesterol`.
+
+### Constraints (optional, default='test')
+
+- The constraints applied to the benchmark as defined by default in [constraints.yaml](GITHUB/resources/constraints.yaml).
+ These include time constraints, memory constrains, the number of available cpu cores, and more.
+ Default constraint is `test` (2 folds for 10 min each).
+
+ !!! warning "Constraints are not enforced!"
+ These constraints are forwarded to the AutoML framework if possible but, except for
+ runtime constraints, are generally not enforced. It is advised when benchmarking
+ to use an environment that mimics the given constraints.
+
+ ??? info "Constraints can be overriden by `benchmark`"
+ A benchmark definition can override constraints on a task level.
+ This is useful if you want to define a benchmark which has different constraints
+ for different tasks. The default "test" benchmark does this to limit runtime to
+ 60 seconds instead of 600 seconds, which is useful to get quick results for its
+ small datasets. For more information, see [defining a benchmark](#ADD-link-to-adding-benchmark).
+
+### Mode (optional, default='local')
+
+- The benchmark can be run in four modes:
+
+ * `local`: install a local virtual environment and run the benchmark on your machine.
+ * `docker`: create a docker image with the virtual environment and run the benchmark in a container on your machine.
+ If a local or remote image already exists, that will be used instead. Requires [Docker](https://docs.docker.com/desktop/).
+ * `singularity`: create a singularity image with the virtual environment and run the benchmark in a container on your machine. Requires [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html).
+ * `aws`: run the benchmark on [AWS EC2](https://aws.amazon.com/free/?trk=b3f93e34-c1e0-4aa9-95f8-6d2c36891d8a&sc_channel=ps&ef_id=CjwKCAjw-7OlBhB8EiwAnoOEk0li05IUgU9Ok2uCdejP22Yr7ZuqtMeJZAdxgL5KZFaeOVskCAsknhoCSjUQAvD_BwE:G:s&s_kwcid=AL!4422!3!649687387631!e!!g!!aws%20ec2!19738730094!148084749082&all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc&awsf.Free%20Tier%20Types=*all&awsf.Free%20Tier%20Categories=*all) instances.
+ It is possible to run directly on the instance or have the EC2 instance run in `docker` mode.
+ Requires valid AWS credentials to be configured, for more information see [Running on AWS](#ADD-link-to-aws-guide).
+
+
+For a full list of parameters available, run:
+
+```
+python runbenchmark.py --help
+```
diff --git a/docs/automl/benchmark_on_openml.md b/docs/automl/benchmark_on_openml.md
new file mode 100644
index 00000000..ba45783e
--- /dev/null
+++ b/docs/automl/benchmark_on_openml.md
@@ -0,0 +1,29 @@
+# Example: Benchmarks on OpenML
+
+In the previous examples, we used benchmarks which were defined in a local file
+([test.yaml](GITHUB/resources/benchmarks/test.yaml) and
+[validation.yaml](GITHUB/resources/benchmarks/validation.yaml), respectively).
+However, we can also use tasks and
+benchmarking suites defined on OpenML directly from the command line. When referencing
+an OpenML task or suite, we can use `openml/t/ID` or `openml/s/ID` respectively as
+argument for the benchmark parameter. Running on the [iris task](https://openml.org/t/59):
+
+```
+python runbenchmark.py randomforest openml/t/59
+```
+
+or on the entire [AutoML benchmark classification suite](https://openml.org/s/271) (this will take hours!):
+
+```
+python runbenchmark.py randomforest openml/s/271
+```
+
+!!! info "Large-scale Benchmarking"
+
+ For large scale benchmarking it is advised to parallelize your experiments,
+ as otherwise it may take months to run the experiments.
+ The benchmark currently only supports native parallelization in `aws` mode
+ (by using the `--parallel` parameter), but using the `--task` and `--fold` parameters
+ it is easy to generate scripts that invoke individual jobs on e.g., a SLURM cluster.
+ When you run in any parallelized fashion, it is advised to run each process on
+ separate hardware to ensure experiments can not interfere with each other.
\ No newline at end of file
diff --git a/docs/automl/important_params.md b/docs/automl/important_params.md
new file mode 100644
index 00000000..a3155998
--- /dev/null
+++ b/docs/automl/important_params.md
@@ -0,0 +1,56 @@
+# Important Parameters
+
+As you can see from the results above, the default behavior is to execute a short test
+benchmark. However, we can specify a different benchmark, provide different constraints,
+and even run the experiment in a container or on AWS. There are many parameters
+for the `runbenchmark.py` script, but the most important ones are:
+
+`Framework (required)`
+
+- The AutoML framework or baseline to evaluate and is not case-sensitive. See
+ [integrated frameworks](WEBSITE/frameworks.html) for a list of supported frameworks.
+ In the above example, this benchmarked framework `randomforest`.
+
+`Benchmark (optional, default='test')`
+
+- The benchmark suite is the dataset or set of datasets to evaluate the framework on.
+ These can be defined as on [OpenML](https://www.openml.org) as a [study or task](extending/benchmark.md#defining-a-benchmark-on-openml)
+ (formatted as `openml/s/X` or `openml/t/Y` respectively) or in a [local file](extending/benchmark.md#defining-a-benchmark-with-a-file).
+ The default is a short evaluation on two folds of `iris`, `kc2`, and `cholesterol`.
+
+`Constraints (optional, default='test')`
+
+- The constraints applied to the benchmark as defined by default in [constraints.yaml](GITHUB/resources/constraints.yaml).
+ These include time constraints, memory constrains, the number of available cpu cores, and more.
+ Default constraint is `test` (2 folds for 10 min each).
+
+ !!! warning "Constraints are not enforced!"
+ These constraints are forwarded to the AutoML framework if possible but, except for
+ runtime constraints, are generally not enforced. It is advised when benchmarking
+ to use an environment that mimics the given constraints.
+
+ ??? info "Constraints can be overriden by `benchmark`"
+ A benchmark definition can override constraints on a task level.
+ This is useful if you want to define a benchmark which has different constraints
+ for different tasks. The default "test" benchmark does this to limit runtime to
+ 60 seconds instead of 600 seconds, which is useful to get quick results for its
+ small datasets. For more information, see [defining a benchmark](#ADD-link-to-adding-benchmark).
+
+`Mode (optional, default='local')`
+
+- The benchmark can be run in four modes:
+
+ * `local`: install a local virtual environment and run the benchmark on your machine.
+ * `docker`: create a docker image with the virtual environment and run the benchmark in a container on your machine.
+ If a local or remote image already exists, that will be used instead. Requires [Docker](https://docs.docker.com/desktop/).
+ * `singularity`: create a singularity image with the virtual environment and run the benchmark in a container on your machine. Requires [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html).
+ * `aws`: run the benchmark on [AWS EC2](https://aws.amazon.com/free/?trk=b3f93e34-c1e0-4aa9-95f8-6d2c36891d8a&sc_channel=ps&ef_id=CjwKCAjw-7OlBhB8EiwAnoOEk0li05IUgU9Ok2uCdejP22Yr7ZuqtMeJZAdxgL5KZFaeOVskCAsknhoCSjUQAvD_BwE:G:s&s_kwcid=AL!4422!3!649687387631!e!!g!!aws%20ec2!19738730094!148084749082&all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc&awsf.Free%20Tier%20Types=*all&awsf.Free%20Tier%20Categories=*all) instances.
+ It is possible to run directly on the instance or have the EC2 instance run in `docker` mode.
+ Requires valid AWS credentials to be configured, for more information see [Running on AWS](#ADD-link-to-aws-guide).
+
+
+For a full list of parameters available, run:
+
+```
+python runbenchmark.py --help
+```
\ No newline at end of file
diff --git a/docs/automl/specific_task_fold_example.md b/docs/automl/specific_task_fold_example.md
new file mode 100644
index 00000000..7b894737
--- /dev/null
+++ b/docs/automl/specific_task_fold_example.md
@@ -0,0 +1,27 @@
+# Example: AutoML on a specific task and fold
+
+The defaults are very useful for performing a quick test, as the datasets are small
+and cover different task types (binary classification, multiclass classification, and
+regression). We also have a ["validation" benchmark](GITHUB/resources/benchmarks/validation.yaml)
+suite for more elaborate testing that also includes missing data, categorical data,
+wide data, and more. The benchmark defines 9 tasks, and evaluating two folds with a
+10-minute time constraint would take roughly 3 hours (=9 tasks * 2 folds * 10 minutes,
+plus overhead). Let's instead use the `--task` and `--fold` parameters to run only a
+specific task and fold in the `benchmark` when evaluating the
+[flaml](https://microsoft.github.io/FLAML/) AutoML framework:
+
+```
+python runbenchmark.py flaml validation test -t eucalyptus -f 0
+```
+
+This should take about 10 minutes plus the time it takes to install `flaml`.
+Results should look roughly like this:
+
+```
+Processing results for flaml.validation.test.local.20230711T122823
+Summing up scores for current run:
+ id task fold framework constraint result metric duration seed
+openml.org/t/2079 eucalyptus 0 flaml test -0.702976 neg_logloss 611.0 1385946458
+```
+
+Similarly to the test run, you will find additional files in the `results` directory.
diff --git a/docs/concepts/.pages b/docs/concepts/.pages
new file mode 100644
index 00000000..e69de29b
diff --git a/docs/concepts/benchmarking.md b/docs/concepts/benchmarking.md
new file mode 100644
index 00000000..8250b980
--- /dev/null
+++ b/docs/concepts/benchmarking.md
@@ -0,0 +1,19 @@
+# Collections and benchmarks
+You can combine tasks and runs into collections, to run experiments across many tasks at once and collect all results. Each collection gets its own page, which can be linked to publications so that others can find all the details online.
+
+## Benchmarking suites
+Collections of tasks can be published as _benchmarking suites_. Seamlessly integrated into the OpenML platform, benchmark suites standardize the setup, execution, analysis, and reporting of benchmarks. Moreover, they make benchmarking a whole lot easier:
+- all datasets are uniformly formatted in standardized data formats
+- they can be easily downloaded programmatically through APIs and client libraries
+- they come with machine-readable meta-information, such as the occurrence of missing values, to train algorithms correctly
+- standardized train-test splits are provided to ensure that results can be objectively compared - results can be shared in a reproducible way through the APIs
+- results from other users can be easily downloaded and reused
+
+You can search for all existing benchmarking suites or create your own. For all further details, see the [benchmarking guide](benchmark).
+
+
+
+## Benchmark studies
+Collections of runs can be published as _benchmarking studies_. They contain the results of all runs (possibly millions) executed on a specific benchmarking suite. OpenML allows you to easily download all such results at once via the APIs, but also visualized them online in the Analysis tab (next to the complete list of included tasks and runs). Below is an example of a benchmark study for AutoML algorithms.
+
+
\ No newline at end of file
diff --git a/docs/concepts/data.md b/docs/concepts/data.md
new file mode 100644
index 00000000..b1537c1d
--- /dev/null
+++ b/docs/concepts/data.md
@@ -0,0 +1,54 @@
+# Data
+## Discovery
+OpenML allows fine-grained search over thousands of machine learning datasets. Via the website, you can filter by many dataset properties, such as size, type, format, and many more. Via the [APIs](https://www.openml.org/apis) you have access to many more filters, and you can download a complete table with statistics of all datasest. Via the APIs you can also load datasets directly into your preferred data structures such as numpy ([example in Python](https://openml.github.io/openml-python/main/examples/20_basic/simple_datasets_tutorial.html#sphx-glr-examples-20-basic-simple-datasets-tutorial-py)). We are also working on better organization of all datasets by topic
+
+
+
+
+## Sharing
+You can upload and download datasets through the website or though the [APIs](https://www.openml.org/apis) (recommended). You can share data directly from common data science libraries, e.g. from Python or R dataframes, in a few lines of code. The OpenML APIs will automatically extract lots of meta-data and store all datasets in a uniform format.
+
+``` python
+ import pandas as pd
+ import openml as oml
+
+ # Create an OpenML dataset from a pandas dataframe
+ df = pd.DataFrame(data, columns=attribute_names)
+ my_data = oml.datasets.functions.create_dataset(
+ name="covertype", description="Predicting forest cover ...",
+ licence="CC0", data=df
+ )
+
+ # Share the dataset on OpenML
+ my_data.publish()
+```
+
+Every dataset gets a dedicated page on OpenML with all known information, and can be edited further online.
+
+
+
+
+Data hosted elsewhere can be referenced by URL. We are also working on interconnecting OpenML with other machine learning data set repositories
+
+## Automated analysis
+OpenML will automatically analyze the data and compute a range of data quality characteristics. These include simple statistics such as the number of examples and features, but also potential quality issues (e.g. missing values) and more advanced statistics (e.g. the mutual information in the features and benchmark performances of simple models). These can be useful to find, filter and compare datasets, or to automate data preprocessing. We are also working on simple metrics and automated dataset quality reports
+
+The Analysis tab (see image below, or try it live) also shows an automated and interactive analysis of all datasets. This runs on open-source Python code via Dash and we welcome all contributions
+
+
+
+
+The third tab, 'Tasks', lists all tasks created on the dataset. More on that below.
+
+## Dataset ID and versions
+A dataset can be uniquely identified by its dataset ID, which is shown on the website and returned by the API. It's `1596` in the `covertype` example above. They can also be referenced by name and ID. OpenML assigns incremental version numbers per upload with the same name. You can also add a free-form `version_label` with every upload.
+
+## Dataset status
+When you upload a dataset, it will be marked `in_preparation` until it is (automatically) verified. Once approved, the dataset will become `active` (or `verified`). If a severe issue has been found with a dataset, it can become `deactivated` (or `deprecated`) signaling that it should not be used. By default, dataset search only returns verified datasets, but you can access and download datasets with any status.
+
+## Special attributes
+Machine learning datasets often have special attributes that require special handling in order to build useful models. OpenML marks these as special attributes.
+
+A `target` attribute is the column that is to be predicted, also known as dependent variable. Datasets can have a default target attribute set by the author, but OpenML tasks can also overrule this. Example: The default target variable for the MNIST dataset is to predict the class from pixel values, and most supervised tasks will have the class as their target. However, one can also create a task aimed at predicting the value of pixel257 given all the other pixel values and the class column.
+
+`Row id` attributes indicate externally defined row IDs (e.g. `instance` in dataset 164). `Ignore` attributes are other columns that should not be included in training data (e.g. `Player` in dataset 185). OpenML will clearly mark these, and will (by default) drop these columns when constructing training sets.
\ No newline at end of file
diff --git a/docs/concepts/flows.md b/docs/concepts/flows.md
new file mode 100644
index 00000000..7e768af3
--- /dev/null
+++ b/docs/concepts/flows.md
@@ -0,0 +1,44 @@
+# Flows
+
+Flows are machine learning pipelines, models, or scripts. They are typically uploaded directly from machine learning libraries (e.g. scikit-learn, pyTorch, TensorFlow, MLR, WEKA,...) via the corresponding [APIs](https://www.openml.org/apis). Associated code (e.g., on GitHub) can be referenced by URL.
+
+## Analysing algorithm performance
+
+Every flow gets a dedicated page with all known information. The Analysis tab shows an automated interactive analysis of all collected results. For instance, below are the results of a scikit-learn pipeline including missing value imputation, feature encoding, and a RandomForest model. It shows the results across multiple tasks, and how the AUC score is affected by certain hyperparameters.
+
+
+
+
+This helps to better understand specific models, as well as their strengths and weaknesses.
+
+## Automated sharing
+
+When you evaluate algorithms and share the results, OpenML will automatically extract all the details of the algorithm (dependencies, structure, and all hyperparameters), and upload them in the background.
+
+``` python
+ from sklearn import ensemble
+ from openml import tasks, runs
+
+ # Build any model you like.
+ clf = ensemble.RandomForestClassifier()
+
+ # Evaluate the model on a task
+ run = runs.run_model_on_task(clf, task)
+
+ # Share the results, including the flow and all its details.
+ run.publish()
+```
+
+## Reproducing algorithms and experiments
+
+Given an OpenML run, the exact same algorithm or model, with exactly the same hyperparameters, can be reconstructed within the same machine learning library to easily reproduce earlier results.
+
+``` python
+ from openml import runs
+
+ # Rebuild the (scikit-learn) pipeline from run 9864498
+ model = openml.runs.initialize_model_from_run(9864498)
+```
+
+!!! note
+ You may need the exact same library version to reconstruct flows. The API will always state the required version. We aim to add support for VMs so that flows can be easily (re)run in any environment
\ No newline at end of file
diff --git a/docs/concepts/index.md b/docs/concepts/index.md
new file mode 100644
index 00000000..224e096f
--- /dev/null
+++ b/docs/concepts/index.md
@@ -0,0 +1,22 @@
+# Concepts
+OpenML operates on a number of core concepts which are important to understand:
+
+**:fa-database: Datasets**
+Datasets are pretty straight-forward. Tabular datasets are self-contained, consisting of a number of rows (_instances_) and columns (features), including their data types. Other
+modalities (e.g. images) are included via paths to files stored within the same folder.
+Datasets are uniformly formatted ([S3](https://min.io/product/s3-compatibility) buckets with [Parquet](https://parquet.apache.org/) tables, [JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON) metadata, and media files), and are auto-converted and auto-loaded in your desired format by the [APIs](https://www.openml.org/apis) (e.g. in [Python](https://openml.github.io/openml-python/main/)) in a single line of code.
+_Example: The Iris dataset or the Plankton dataset_
+
+
+**:fa-trophy: Tasks**
+A task consists of a dataset, together with a machine learning task to perform, such as classification or clustering and an evaluation method. For
+supervised tasks, this also specifies the target column in the data.
+_Example: Classifying different iris species from other attributes and evaluate using 10-fold cross-validation._
+
+**:fa-cogs: Flows**
+A flow identifies a particular machine learning algorithm (a pipeline or untrained model) from a particular library or framework, such as scikit-learn, pyTorch, or MLR. It contains details about the structure of the model/pipeline, dependencies (e.g. the library and its version) and a list of settable hyperparameters. In short, it is a serialized description of the algorithm that in many cases can also be deserialized to reinstantiate the exact same algorithm in a particular library.
+_Example: scikit-learn's RandomForest or a simple TensorFlow model_
+
+**:fa-star: Runs**
+A run is an experiment - it evaluates a particular flow (pipeline/model) with particular hyperparameter settings, on a particular task. Depending on the task it will include certain results, such as model evaluations (e.g. accuracies), model predictions, and other output files (e.g. the trained model).
+_Example: Classifying Gamma rays with scikit-learn's RandomForest_
\ No newline at end of file
diff --git a/docs/concepts/openness.md b/docs/concepts/openness.md
new file mode 100644
index 00000000..4b2c18c7
--- /dev/null
+++ b/docs/concepts/openness.md
@@ -0,0 +1,6 @@
+# Openness and Authentication
+You can download and inspect all datasets, tasks, flows and runs through the
+website or the API without creating an account. However, if you want to upload
+datasets or experiments, you need to create an account, sign in, and find your API key on your profile page.
+
+This key can then be used with any of the [OpenML APIs](APIs).
\ No newline at end of file
diff --git a/docs/concepts/runs.md b/docs/concepts/runs.md
new file mode 100644
index 00000000..d5a3fc12
--- /dev/null
+++ b/docs/concepts/runs.md
@@ -0,0 +1,16 @@
+# Runs
+
+## Automated reproducible evaluations
+Runs are experiments (benchmarks) evaluating a specific flows on a specific task. As shown above, they are typically submitted automatically by machine learning
+libraries through the OpenML [APIs](https://www.openml.org/apis)), including lots of automatically extracted meta-data, to create reproducible experiments. With a few for-loops you can easily run (and share) millions of experiments.
+
+## Online organization
+OpenML organizes all runs online, linked to the underlying data, flows, parameter settings, people, and other details. See the many examples above, where every dot in the scatterplots is a single OpenML run.
+
+## Independent (server-side) evaluation
+OpenML runs include all information needed to independently evaluate models. For most tasks, this includes all predictions, for all train-test splits, for all instances in the dataset, including all class confidences. When a run is uploaded, OpenML automatically evaluates every run using a wide array of evaluation metrics. This makes them directly comparable with all other runs shared on OpenML. For completeness, OpenML will also upload locally computed evaluation metrics and runtimes.
+
+New metrics can also be added to OpenML's evaluation engine, and computed for all runs afterwards. Or, you can download OpenML runs and analyse the results any way you like.
+
+!!! note
+ Please note that while OpenML tries to maximise reproducibility, exactly reproducing all results may not always be possible because of changes in numeric libraries, operating systems, and hardware.
\ No newline at end of file
diff --git a/docs/concepts/sharing.md b/docs/concepts/sharing.md
new file mode 100644
index 00000000..d4d3f6a0
--- /dev/null
+++ b/docs/concepts/sharing.md
@@ -0,0 +1,2 @@
+# Sharing (under construction)
+Currently, anything on OpenML can be shared publicly or kept private to a single user. We are working on sharing features that allow you to share your materials with other users without making them entirely public. Watch this space
diff --git a/docs/concepts/tagging.md b/docs/concepts/tagging.md
new file mode 100644
index 00000000..1f296155
--- /dev/null
+++ b/docs/concepts/tagging.md
@@ -0,0 +1,6 @@
+# Tagging
+Datasets, tasks, runs and flows can be assigned tags, either via the web
+interface or the API. These tags can be used to search and annotate datasets, or simply to better organize your own datasets and experiments.
+
+For example, the tag OpenML-CC18 refers to all tasks included in the OpenML-CC18 benchmarkign suite.
diff --git a/docs/concepts/tasks.md b/docs/concepts/tasks.md
new file mode 100644
index 00000000..3e154432
--- /dev/null
+++ b/docs/concepts/tasks.md
@@ -0,0 +1,39 @@
+# Tasks
+Tasks describe what to do with the data. OpenML covers several task types, such as classification and clustering. Tasks are containers including the data and other information such as train/test splits, and define what needs to be returned. They are machine-readable so that you can automate machine learning experiments, and easily compare algorithms evaluations (using the exact same train-test splits) against all other benchmarks shared by others on OpenML.
+
+## Collaborative benchmarks
+
+Tasks are real-time, collaborative benchmarks (e.g. see
+MNIST below). In the Analysis tab, you can view timelines and leaderboards, and learn from all prior submissions to design even better algorithms.
+
+
+
+
+## Discover the best algorithms
+All algorithms evaluated on the same task (with the same train-test splits) can be directly compared to each other, so you can easily look up which algorithms perform best overall, and download their exact configurations. Likewise, you can look up the best algorithms for _similar_ tasks to know what to try first.
+
+
+
+
+## Automating benchmarks
+You can search and download existing tasks, evaluate your algorithms, and automatically share the results (which are stored in a _run_). Here's what this looks like in the Python API. You can do the same across hundreds of tasks at once.
+
+``` python
+ from sklearn import ensemble
+ from openml import tasks, runs
+
+ # Build any model you like
+ clf = ensemble.RandomForestClassifier()
+
+ # Download any OpenML task (includes the datasets)
+ task = tasks.get_task(3954)
+
+ # Automatically evaluate your model on the task
+ run = runs.run_model_on_task(clf, task)
+
+ # Share the results on OpenML.
+ run.publish()
+```
+
+You can create new tasks via the website or [via the APIs](https://www.openml.org/apis) as well.
\ No newline at end of file
diff --git a/docs/Client-API-Standards.md b/docs/contributing/Client-API-Standards.md
similarity index 100%
rename from docs/Client-API-Standards.md
rename to docs/contributing/Client-API-Standards.md
diff --git a/docs/Communication-Channels.md b/docs/contributing/Communication-Channels.md
similarity index 100%
rename from docs/Communication-Channels.md
rename to docs/contributing/Communication-Channels.md
diff --git a/docs/Contributing.md b/docs/contributing/Contributing.md
similarity index 100%
rename from docs/Contributing.md
rename to docs/contributing/Contributing.md
diff --git a/docs/Core-team.md b/docs/contributing/Core-team.md
similarity index 100%
rename from docs/Core-team.md
rename to docs/contributing/Core-team.md
diff --git a/docs/Datasets.md b/docs/contributing/Datasets.md
similarity index 100%
rename from docs/Datasets.md
rename to docs/contributing/Datasets.md
diff --git a/docs/contributing/OpenML-Docs.md b/docs/contributing/OpenML-Docs.md
new file mode 100644
index 00000000..b1ae945a
--- /dev/null
+++ b/docs/contributing/OpenML-Docs.md
@@ -0,0 +1,52 @@
+## General Documentation
+The general documentation (the one you are reading now) in written in MarkDown, can be easily edited by clicking the edit button
+(the pencil icon) on the top of every page. It will open up an editing page on [GitHub](https://github.com/) (you do need to be logged in on GitHub). When you are done, add a small message explaining the change and click 'commit changes'. On the next page, just launch the pull request. We will then review it and approve the changes, or discuss them if necessary.
+
+The sources are generated by [MkDocs](http://www.mkdocs.org/), using the [Material theme](https://squidfunk.github.io/mkdocs-material/).
+Check these docs to see what is possible in terms of styling.
+
+!!! Deployment
+ To deploy the documentation manually, you need to have MkDocs and MkDocs-Material installed:
+ ```
+ pip install mkdocs
+ pip install mkdocs-material
+ pip install fontawesome_markdown
+ ```
+ To deploy the documentation locally, run `mkdocs serve` in the top directory (with the `mkdocs.yml` file). Any changes made after that will be hot-loaded.
+
+ The documentation will be auto-deployed with every push or merge with the master branch of `https://www.github.com/openml/docs/`. In the background, a CI job
+ will run `mkdocs gh-deploy`, which will build the HTML files and push them to the gh-pages branch of openml/docs. `https://docs.openml.org` is just a reverse proxy for `https://openml.github.io/docs/`.
+
+
+## REST API
+The REST API is documented using Swagger.io, in YAML. This generates a nice web interface that also allows trying out the API calls using your own API key (when you are logged in).
+
+You can edit the sources on [SwaggerHub](https://app.swaggerhub.com/apis/openml/openml/1.0.0). When you are done, export to json and replace the [downloads/swagger.json](https://github.com/openml/OpenML/blob/master/downloads/swagger.json) file in the OpenML main GitHub repository. You need to do a pull request that is then reviewed by us. When we merge the new file the changes are immediately available.
+
+The [data API](https://app.swaggerhub.com/apis/openml/openml_file/1.0.0) can be edited in the same way.
+
+## Python API
+To edit the tutorial, you have to edit the `reStructuredText` files on [openml-python/doc](https://github.com/openml/openml-python/tree/master/doc). When done, you can do a pull request.
+
+To edit the documentation of the python functions, edit the docstrings in the [Python code](https://github.com/openml/openml-python/openml). When done, you can do a pull request.
+
+!!! note
+ Developers: A CircleCI job will automatically render the documentation on every GitHub commit, using [Sphinx](http://www.sphinx-doc.org/en/stable/).
+
+## R API
+To edit the tutorial, you have to edit the `Rmarkdown` files on [openml-r/vignettes](https://github.com/openml/openml-r/tree/master/vignettes).
+
+To edit the documentation of the R functions, edit the Roxygen documention next to the functions in the [R code](https://github.com/openml/openml-r/R).
+
+!!! note
+ Developers: A Travis job will automatically render the documentation on every GitHub commit, using [knitr](https://yihui.name/knitr/). The Roxygen documentation is updated every time a new version is released on CRAN.
+
+## Java API
+The Java Tutorial is written in markdown and can be edited the usual way (see above).
+
+To edit the documentation of the Java functions, edit the documentation next to the functions in the [Java code](https://github.com/openml/java/apiconnector).
+
+- Javadocs: https://www.openml.org/docs/
+
+!!! note
+ Developers: A Travis job will automatically render the documentation on every GitHub commit, using [Javadoc](http://www.oracle.com/technetwork/java/javase/tech/index-137868.html).
diff --git a/docs/OpenML_definition.md b/docs/contributing/OpenML_definition.md
similarity index 100%
rename from docs/OpenML_definition.md
rename to docs/contributing/OpenML_definition.md
diff --git a/docs/contributing/Visual-Guidelines.md b/docs/contributing/Visual-Guidelines.md
new file mode 100644
index 00000000..51771a76
--- /dev/null
+++ b/docs/contributing/Visual-Guidelines.md
@@ -0,0 +1,14 @@
+# Visual Guidelines
+
+This page contains some visual guidelines that might be useful for you, dear contributor. While these guidelines are not mandatory, they would make the OpenML experience more pleasant and consistent for everyone.
+
+## Colors
+
+-
Primary color: #1E88E5
+-
Primary color (dark): #000482
+-
Primary color (light): #b5b7ff
+
+
+## Logos
+- 
+- 
\ No newline at end of file
diff --git a/docs/API-development.md b/docs/contributing/backend/API-development.md
similarity index 100%
rename from docs/API-development.md
rename to docs/contributing/backend/API-development.md
diff --git a/docs/Java-App.md b/docs/contributing/backend/Java-App.md
similarity index 100%
rename from docs/Java-App.md
rename to docs/contributing/backend/Java-App.md
diff --git a/docs/Local-Installation.md b/docs/contributing/backend/Local-Installation.md
similarity index 100%
rename from docs/Local-Installation.md
rename to docs/contributing/backend/Local-Installation.md
diff --git a/docs/resources.md b/docs/contributing/resources.md
similarity index 100%
rename from docs/resources.md
rename to docs/contributing/resources.md
diff --git a/docs/terms.md b/docs/contributing/terms.md
similarity index 100%
rename from docs/terms.md
rename to docs/contributing/terms.md
diff --git a/docs/Dash.md b/docs/contributing/website/Dash.md
similarity index 100%
rename from docs/Dash.md
rename to docs/contributing/website/Dash.md
diff --git a/docs/Flask.md b/docs/contributing/website/Flask.md
similarity index 100%
rename from docs/Flask.md
rename to docs/contributing/website/Flask.md
diff --git a/docs/React.md b/docs/contributing/website/React.md
similarity index 100%
rename from docs/React.md
rename to docs/contributing/website/React.md
diff --git a/docs/Website.md b/docs/contributing/website/Website.md
similarity index 93%
rename from docs/Website.md
rename to docs/contributing/website/Website.md
index eef94a56..9ace2652 100644
--- a/docs/Website.md
+++ b/docs/contributing/website/Website.md
@@ -70,6 +70,6 @@ npm run dev
The website is built on the following components:
-* A [Flask backend](../Flask). Written in Python, the backend takes care of all communication with the OpenML server. It builds on top of the OpenML Python API. It also takes care of user authentication and keeps the search engine (ElasticSearch) up to date with the latest information from the server. Files are located in the `server` folder.
-* A [React frontend](../React). Written in JavaScript, this takes care of rendering the website. It pulls in information from the search engine, and shows plots rendered by Dash. It also contains forms (e.g. for logging in or uploading new datasets), which will be sent off to the backend for processing. Files are located in `server/src/client/app`.
-* [Dash dashboards](../Dash). Written in Python, Dash is used for writing interactive plots. It pulls in data from the Python API, and renders the plots as React components. Files are located in `server/src/dashboard`.
+* A [Flask backend](Flask.md). Written in Python, the backend takes care of all communication with the OpenML server. It builds on top of the OpenML Python API. It also takes care of user authentication and keeps the search engine (ElasticSearch) up to date with the latest information from the server. Files are located in the `server` folder.
+* A [React frontend](React.md). Written in JavaScript, this takes care of rendering the website. It pulls in information from the search engine, and shows plots rendered by Dash. It also contains forms (e.g. for logging in or uploading new datasets), which will be sent off to the backend for processing. Files are located in `server/src/client/app`.
+* [Dash dashboards](Dash.md). Written in Python, Dash is used for writing interactive plots. It pulls in data from the Python API, and renders the plots as React components. Files are located in `server/src/dashboard`.
diff --git a/docs/css/extra.css b/docs/css/extra.css
index a1bbc73b..d1e7db6d 100644
--- a/docs/css/extra.css
+++ b/docs/css/extra.css
@@ -42,3 +42,49 @@ img[alt="icon"] {
margin-left: -45px;
}
}
+table {
+ display: block;
+ max-width: -moz-fit-content;
+ max-width: fit-content;
+ margin: 0 auto;
+ overflow-x: auto;
+ white-space: nowrap;
+ }
+
+ :root {
+ --md-primary-fg-color: #1E88E5;
+ --md-primary-fg-color--light: #000482;
+ --md-primary-fg-color--dark: #b5b7ff;
+ }
+
+ .card-container {
+ display: flex;
+ flex-wrap: wrap;
+ gap: 20px;
+ justify-content: center;
+ }
+
+ .card {
+ border: 1px solid #ccc;
+ border-radius: 5px;
+ padding: 20px;
+ width: 300px;
+ box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
+ }
+
+ .card h2 {
+ margin-top: 0;
+ }
+
+ .card p {
+ margin-bottom: 0;}
+
+ .github-logo {
+ height: 15px;
+ width: 13px;
+ margin-left: 10px;
+ }
+
+ iframe[seamless] {
+ border: none;
+}
\ No newline at end of file
diff --git a/docs/img/logo-github.svg b/docs/img/logo-github.svg
new file mode 100644
index 00000000..2a38d6f6
--- /dev/null
+++ b/docs/img/logo-github.svg
@@ -0,0 +1,12 @@
+
+
+
+
diff --git a/docs/index.md b/docs/index.md
index e2b82882..db88998d 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -10,216 +10,65 @@
Make your work more visible and reusable
Built for automation: streamline your experiments and model building
-## Concepts
-OpenML operates on a number of core concepts which are important to understand:
+## Installation
-**:fa-database: Datasets**
-Datasets are pretty straight-forward. Tabular datasets are self-contained, consisting of a number of rows (_instances_) and columns (features), including their data types. Other
-modalities (e.g. images) are included via paths to files stored within the same folder.
-Datasets are uniformly formatted ([S3](https://min.io/product/s3-compatibility) buckets with [Parquet](https://parquet.apache.org/) tables, [JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON) metadata, and media files), and are auto-converted and auto-loaded in your desired format by the [APIs](https://www.openml.org/apis) (e.g. in [Python](https://openml.github.io/openml-python/main/)) in a single line of code.
-_Example: The Iris dataset or the Plankton dataset_
+The OpenML package is available in many languages and across libraries. For more information about them, see the [Integrations](./integrations/index.md) page.
+=== "Python/sklearn"
-**:fa-trophy: Tasks**
-A task consists of a dataset, together with a machine learning task to perform, such as classification or clustering and an evaluation method. For
-supervised tasks, this also specifies the target column in the data.
-_Example: Classifying different iris species from other attributes and evaluate using 10-fold cross-validation._
+ - [Python/sklearn repository](https://github.com/openml/openml-python)
+ - `pip install openml`
-**:fa-cogs: Flows**
-A flow identifies a particular machine learning algorithm (a pipeline or untrained model) from a particular library or framework, such as scikit-learn, pyTorch, or MLR. It contains details about the structure of the model/pipeline, dependencies (e.g. the library and its version) and a list of settable hyperparameters. In short, it is a serialized description of the algorithm that in many cases can also be deserialized to reinstantiate the exact same algorithm in a particular library.
-_Example: scikit-learn's RandomForest or a simple TensorFlow model_
+=== "Pytorch"
-**:fa-star: Runs**
-A run is an experiment - it evaluates a particular flow (pipeline/model) with particular hyperparameter settings, on a particular task. Depending on the task it will include certain results, such as model evaluations (e.g. accuracies), model predictions, and other output files (e.g. the trained model).
-_Example: Classifying Gamma rays with scikit-learn's RandomForest_
+ - [Pytorch repository](https://github.com/openml/openml-pytorch)
+ - `pip install openml-pytorch`
+=== "Keras"
-## Data
-### Discovery
-OpenML allows fine-grained search over thousands of machine learning datasets. Via the website, you can filter by many dataset properties, such as size, type, format, and many more. Via the [APIs](https://www.openml.org/apis) you have access to many more filters, and you can download a complete table with statistics of all datasest. Via the APIs you can also load datasets directly into your preferred data structures such as numpy ([example in Python](https://openml.github.io/openml-python/main/examples/20_basic/simple_datasets_tutorial.html#sphx-glr-examples-20-basic-simple-datasets-tutorial-py)). We are also working on better organization of all datasets by topic
+ - [Keras repository](https://github.com/openml/openml-keras)
+ - `pip install openml-keras`
-
+=== "TensorFlow"
+
+ - [TensorFlow repository](https://github.com/openml/openml-tensorflow)
+ - `pip install openml-tensorflow`
+
+=== "R"
+
+ - [R repository](https://github.com/openml/openml-R)
+ - `install.packages("mlr3oml")`
+=== "Julia"
+
+ - [Julia repository](https://github.com/JuliaAI/OpenML.jl/tree/master)
+ - `using Pkg;Pkg.add("OpenML")`
-### Sharing
-You can upload and download datasets through the website or though the [APIs](https://www.openml.org/apis) (recommended). You can share data directly from common data science libraries, e.g. from Python or R dataframes, in a few lines of code. The OpenML APIs will automatically extract lots of meta-data and store all datasets in a uniform format.
+=== "RUST"
+
+ - [RUST repository](https://github.com/mbillingr/openml-rust)
+ - Install from source
-``` python
- import pandas as pd
- import openml as oml
+=== ".Net"
+
+ - [.Net repository](https://github.com/openml/openml-dotnet)
+ - `Install-Package openMl`
- # Create an OpenML dataset from a pandas dataframe
- df = pd.DataFrame(data, columns=attribute_names)
- my_data = oml.datasets.functions.create_dataset(
- name="covertype", description="Predicting forest cover ...",
- licence="CC0", data=df
- )
- # Share the dataset on OpenML
- my_data.publish()
-```
+You might also need to set up the API key. For more information, see the [API key page](./apikey.md)
-Every dataset gets a dedicated page on OpenML with all known information, and can be edited further online.
+## Learning OpenML
-
+Aside from the individual package documentations, you can learn more about OpenML through the following resources:
+The core concepts of OpenML are explained in the [Concepts](./concepts/index.md) page. These concepts include the principle behind using Datasets, Runs, Tasks, Flows, Benchmarking and much more. Going through them will help you leverage OpenML even better in your work.
-Data hosted elsewhere can be referenced by URL. We are also working on interconnecting OpenML with other machine learning data set repositories
+## Contributing to OpenML
-### Automated analysis
-OpenML will automatically analyze the data and compute a range of data quality characteristics. These include simple statistics such as the number of examples and features, but also potential quality issues (e.g. missing values) and more advanced statistics (e.g. the mutual information in the features and benchmark performances of simple models). These can be useful to find, filter and compare datasets, or to automate data preprocessing. We are also working on simple metrics and automated dataset quality reports
+OpenML is an open source project, hosted on GitHub. We welcome everybody to help improve OpenML, and make it more useful for everyone. For more information on how to contribute, see the [Contributing](./contributing/Contributing.md) page.
-The Analysis tab (see image below, or try it live) also shows an automated and interactive analysis of all datasets. This runs on open-source Python code via Dash and we welcome all contributions
+We want to make machine learning and data analysis **simple**, **accessible**, **collaborative** and **open** with an optimal **division of labour** between computers and humans.
-
+## Want to get involved?
-The third tab, 'Tasks', lists all tasks created on the dataset. More on that below.
+Awesome, we're happy to have you! :tada:
-### Dataset ID and versions
-A dataset can be uniquely identified by its dataset ID, which is shown on the website and returned by the API. It's `1596` in the `covertype` example above. They can also be referenced by name and ID. OpenML assigns incremental version numbers per upload with the same name. You can also add a free-form `version_label` with every upload.
-
-### Dataset status
-When you upload a dataset, it will be marked `in_preparation` until it is (automatically) verified. Once approved, the dataset will become `active` (or `verified`). If a severe issue has been found with a dataset, it can become `deactivated` (or `deprecated`) signaling that it should not be used. By default, dataset search only returns verified datasets, but you can access and download datasets with any status.
-
-### Special attributes
-Machine learning datasets often have special attributes that require special handling in order to build useful models. OpenML marks these as special attributes.
-
-A `target` attribute is the column that is to be predicted, also known as dependent variable. Datasets can have a default target attribute set by the author, but OpenML tasks can also overrule this. Example: The default target variable for the MNIST dataset is to predict the class from pixel values, and most supervised tasks will have the class as their target. However, one can also create a task aimed at predicting the value of pixel257 given all the other pixel values and the class column.
-
-`Row id` attributes indicate externally defined row IDs (e.g. `instance` in dataset 164). `Ignore` attributes are other columns that should not be included in training data (e.g. `Player` in dataset 185). OpenML will clearly mark these, and will (by default) drop these columns when constructing training sets.
-
-## Tasks
-Tasks describe what to do with the data. OpenML covers several task types, such as classification and clustering. Tasks are containers including the data and other information such as train/test splits, and define what needs to be returned. They are machine-readable so that you can automate machine learning experiments, and easily compare algorithms evaluations (using the exact same train-test splits) against all other benchmarks shared by others on OpenML.
-
-### Collaborative benchmarks
-
-Tasks are real-time, collaborative benchmarks (e.g. see
-MNIST below). In the Analysis tab, you can view timelines and leaderboards, and learn from all prior submissions to design even better algorithms.
-
-
-
-### Discover the best algorithms
-All algorithms evaluated on the same task (with the same train-test splits) can be directly compared to each other, so you can easily look up which algorithms perform best overall, and download their exact configurations. Likewise, you can look up the best algorithms for _similar_ tasks to know what to try first.
-
-
-
-### Automating benchmarks
-You can search and download existing tasks, evaluate your algorithms, and automatically share the results (which are stored in a _run_). Here's what this looks like in the Python API. You can do the same across hundreds of tasks at once.
-
-``` python
- from sklearn import ensemble
- from openml import tasks, runs
-
- # Build any model you like
- clf = ensemble.RandomForestClassifier()
-
- # Download any OpenML task (includes the datasets)
- task = tasks.get_task(3954)
-
- # Automatically evaluate your model on the task
- run = runs.run_model_on_task(clf, task)
-
- # Share the results on OpenML.
- run.publish()
-```
-
-You can create new tasks via the website or [via the APIs](https://www.openml.org/apis) as well.
-
-## Flows
-
-Flows are machine learning pipelines, models, or scripts. They are typically uploaded directly from machine learning libraries (e.g. scikit-learn, pyTorch, TensorFlow, MLR, WEKA,...) via the corresponding [APIs](https://www.openml.org/apis). Associated code (e.g., on GitHub) can be referenced by URL.
-
-### Analysing algorithm performance
-
-Every flow gets a dedicated page with all known information. The Analysis tab shows an automated interactive analysis of all collected results. For instance, below are the results of a scikit-learn pipeline including missing value imputation, feature encoding, and a RandomForest model. It shows the results across multiple tasks, and how the AUC score is affected by certain hyperparameters.
-
-
-
-This helps to better understand specific models, as well as their strengths and weaknesses.
-
-### Automated sharing
-
-When you evaluate algorithms and share the results, OpenML will automatically extract all the details of the algorithm (dependencies, structure, and all hyperparameters), and upload them in the background.
-
-``` python
- from sklearn import ensemble
- from openml import tasks, runs
-
- # Build any model you like.
- clf = ensemble.RandomForestClassifier()
-
- # Evaluate the model on a task
- run = runs.run_model_on_task(clf, task)
-
- # Share the results, including the flow and all its details.
- run.publish()
-```
-
-### Reproducing algorithms and experiments
-
-Given an OpenML run, the exact same algorithm or model, with exactly the same hyperparameters, can be reconstructed within the same machine learning library to easily reproduce earlier results.
-
-``` python
- from openml import runs
-
- # Rebuild the (scikit-learn) pipeline from run 9864498
- model = openml.runs.initialize_model_from_run(9864498)
-```
-
-!!! note
- You may need the exact same library version to reconstruct flows. The API will always state the required version. We aim to add support for VMs so that flows can be easily (re)run in any environment
-
-## Runs
-
-### Automated reproducible evaluations
-Runs are experiments (benchmarks) evaluating a specific flows on a specific task. As shown above, they are typically submitted automatically by machine learning
-libraries through the OpenML [APIs](https://www.openml.org/apis)), including lots of automatically extracted meta-data, to create reproducible experiments. With a few for-loops you can easily run (and share) millions of experiments.
-
-### Online organization
-OpenML organizes all runs online, linked to the underlying data, flows, parameter settings, people, and other details. See the many examples above, where every dot in the scatterplots is a single OpenML run.
-
-### Independent (server-side) evaluation
-OpenML runs include all information needed to independently evaluate models. For most tasks, this includes all predictions, for all train-test splits, for all instances in the dataset, including all class confidences. When a run is uploaded, OpenML automatically evaluates every run using a wide array of evaluation metrics. This makes them directly comparable with all other runs shared on OpenML. For completeness, OpenML will also upload locally computed evaluation metrics and runtimes.
-
-New metrics can also be added to OpenML's evaluation engine, and computed for all runs afterwards. Or, you can download OpenML runs and analyse the results any way you like.
-
-!!! note
- Please note that while OpenML tries to maximise reproducibility, exactly reproducing all results may not always be possible because of changes in numeric libraries, operating systems, and hardware.
-
-
-## Collections and benchmarks
-You can combine tasks and runs into collections, to run experiments across many tasks at once and collect all results. Each collection gets its own page, which can be linked to publications so that others can find all the details online.
-
-### Benchmarking suites
-Collections of tasks can be published as _benchmarking suites_. Seamlessly integrated into the OpenML platform, benchmark suites standardize the setup, execution, analysis, and reporting of benchmarks. Moreover, they make benchmarking a whole lot easier:
-- all datasets are uniformly formatted in standardized data formats
-- they can be easily downloaded programmatically through APIs and client libraries
-- they come with machine-readable meta-information, such as the occurrence of missing values, to train algorithms correctly
-- standardized train-test splits are provided to ensure that results can be objectively compared - results can be shared in a reproducible way through the APIs
-- results from other users can be easily downloaded and reused
-
-You can search for all existing benchmarking suites or create your own. For all further details, see the [benchmarking guide](benchmark).
-
-
-
-### Benchmark studies
-Collections of runs can be published as _benchmarking studies_. They contain the results of all runs (possibly millions) executed on a specific benchmarking suite. OpenML allows you to easily download all such results at once via the APIs, but also visualized them online in the Analysis tab (next to the complete list of included tasks and runs). Below is an example of a benchmark study for AutoML algorithms.
-
-
-
-## Tagging
-Datasets, tasks, runs and flows can be assigned tags, either via the web
-interface or the API. These tags can be used to search and annotate datasets, or simply to better organize your own datasets and experiments.
-
-For example, the tag OpenML-CC18 refers to all tasks included in the OpenML-CC18 benchmarkign suite.
-
-## Openness and Authentication
-You can download and inspect all datasets, tasks, flows and runs through the
-website or the API without creating an account. However, if you want to upload
-datasets or experiments, you need to create an account, sign in, and find your API key on your profile page.
-
-This key can then be used with any of the [OpenML APIs](APIs).
-
-
-## Sharing (under construction)
-Currently, anything on OpenML can be shared publicly or kept private to a single user. We are working on sharing features that allow you to share your materials with other users without making them entirely public. Watch this space
+OpenML is dependent on the community. If you want to help, please email us (openmlHQ@googlegroups.com). If you feel already comfortable you can help by opening issues or make a pull request on GitHub. We also have regular workshops you can join (they are announced on openml.org).
diff --git a/docs/Java-guide.md b/docs/integrations/Java.md
similarity index 99%
rename from docs/Java-guide.md
rename to docs/integrations/Java.md
index b3e8d039..8dbbd13d 100644
--- a/docs/Java-guide.md
+++ b/docs/integrations/Java.md
@@ -241,4 +241,4 @@ Uploads a run to OpenML, including a description and a set of output files depen
outputs.add("predictions",new File("predictions.arff"));
UploadRun response = client.runUpload( run, outputs);
int run_id = response.getRun_id();
-```
+```
\ No newline at end of file
diff --git a/docs/integrations/Julia.md b/docs/integrations/Julia.md
new file mode 100644
index 00000000..3f012f83
--- /dev/null
+++ b/docs/integrations/Julia.md
@@ -0,0 +1,82 @@
+# OpenML.jl (Julia) Documentation
+
+This is the reference documentation of
+[`OpenML.jl`](https://github.com/JuliaAI/OpenML.jl).
+
+The [OpenML platform](https://www.openml.org) provides an integration
+platform for carrying out and comparing machine learning solutions
+across a broad collection of public datasets and software platforms.
+
+Summary of OpenML.jl functionality:
+
+- [`OpenML.list_tags`](@ref)`()`: for listing all dataset tags
+
+- [`OpenML.list_datasets`](@ref)`(; tag=nothing, filter=nothing, output_format=...)`: for listing available datasets
+
+- [`OpenML.describe_dataset`](@ref)`(id)`: to describe a particular dataset
+
+- [`OpenML.load`](@ref)`(id; parser=:arff)`: to download a dataset
+
+
+## Installation
+
+```julia
+using Pkg
+Pkg.add("OpenML")
+```
+
+If running the demonstration below:
+
+```julia
+Pkg.add("DataFrames")
+Pkg.add("ScientificTypes")
+```
+
+## Sample usage
+
+```
+using OpenML # or using MLJ
+using DataFrames
+
+OpenML.list_tags()
+```
+
+Listing all datasets with the "OpenML100" tag which also have `n`
+instances and `p` features, where `100 < n < 1000` and `1 < p < 10`:
+
+```
+ds = OpenML.list_datasets(
+ tag = "OpenML100",
+ filter = "number_instances/100..1000/number_features/1..10",
+ output_format = DataFrame)
+```
+
+Describing and loading one of these datasets:
+
+```
+OpenML.describe_dataset(15)
+table = OpenML.load(15)
+```
+
+Converting to a data frame:
+
+```
+df = DataFrame(table)
+```
+
+Inspecting it's schema:
+
+```
+using ScientificTypes
+schema(table)
+```
+
+## Public API
+
+```@docs
+OpenML.list_tags
+OpenML.list_datasets
+OpenML.describe_dataset
+OpenML.load
+```
+
diff --git a/docs/integrations/Keras.md b/docs/integrations/Keras.md
new file mode 100644
index 00000000..e69de29b
diff --git a/docs/MOA.md b/docs/integrations/MOA.md
similarity index 100%
rename from docs/MOA.md
rename to docs/integrations/MOA.md
diff --git a/docs/integrations/Pytorch/basic_tutorial.ipynb b/docs/integrations/Pytorch/basic_tutorial.ipynb
new file mode 100644
index 00000000..97115fcf
--- /dev/null
+++ b/docs/integrations/Pytorch/basic_tutorial.ipynb
@@ -0,0 +1,346 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ ""
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/markdown": [
+ "[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from IPython.display import display, HTML, Markdown\n",
+ "import os\n",
+ "import yaml\n",
+ "with open(\"../../../mkdocs.yml\", \"r\") as f:\n",
+ " load_config = yaml.safe_load(f)\n",
+ "repo_url = load_config[\"repo_url\"].replace(\"https://github.com/\", \"\")\n",
+ "binder_url = load_config[\"binder_url\"]\n",
+ "relative_file_path = \"integrations/Pytorch/basic_tutorial.ipynb\"\n",
+ "display(HTML(f\"\"\"\n",
+ " \n",
+ "\"\"\"))\n",
+ "display(Markdown(\"[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)\"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: openml-pytorch in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages/openml_pytorch-0.0.5-py3.9.egg (0.0.5)\n",
+ "Requirement already satisfied: openml in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml-pytorch) (0.13.1)\n",
+ "Requirement already satisfied: torch<2.2.0,>=1.4.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml-pytorch) (2.1.2)\n",
+ "Requirement already satisfied: onnx in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml-pytorch) (1.16.0)\n",
+ "Requirement already satisfied: torchvision in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml-pytorch) (0.16.2)\n",
+ "Requirement already satisfied: typing-extensions in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torch<2.2.0,>=1.4.0->openml-pytorch) (4.11.0)\n",
+ "Requirement already satisfied: sympy in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torch<2.2.0,>=1.4.0->openml-pytorch) (1.12)\n",
+ "Requirement already satisfied: filelock in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torch<2.2.0,>=1.4.0->openml-pytorch) (3.12.0)\n",
+ "Requirement already satisfied: jinja2 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torch<2.2.0,>=1.4.0->openml-pytorch) (3.1.3)\n",
+ "Requirement already satisfied: networkx in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torch<2.2.0,>=1.4.0->openml-pytorch) (3.2.1)\n",
+ "Requirement already satisfied: fsspec in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torch<2.2.0,>=1.4.0->openml-pytorch) (2023.6.0)\n",
+ "Requirement already satisfied: protobuf>=3.20.2 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from onnx->openml-pytorch) (5.26.1)\n",
+ "Requirement already satisfied: numpy>=1.20 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from onnx->openml-pytorch) (1.24.2)\n",
+ "Requirement already satisfied: liac-arff>=2.4.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (2.5.0)\n",
+ "Requirement already satisfied: xmltodict in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (0.13.0)\n",
+ "Requirement already satisfied: requests in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (2.28.2)\n",
+ "Requirement already satisfied: scikit-learn>=0.18 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (1.2.2)\n",
+ "Requirement already satisfied: python-dateutil in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (2.8.2)\n",
+ "Requirement already satisfied: pandas>=1.0.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (1.5.3)\n",
+ "Requirement already satisfied: scipy>=0.13.3 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (1.10.1)\n",
+ "Requirement already satisfied: minio in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (7.1.13)\n",
+ "Requirement already satisfied: pyarrow in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml->openml-pytorch) (11.0.0)\n",
+ "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from torchvision->openml-pytorch) (10.3.0)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from pandas>=1.0.0->openml->openml-pytorch) (2022.7.1)\n",
+ "Requirement already satisfied: six>=1.5 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from python-dateutil->openml->openml-pytorch) (1.16.0)\n",
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from scikit-learn>=0.18->openml->openml-pytorch) (3.1.0)\n",
+ "Requirement already satisfied: joblib>=1.1.1 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from scikit-learn>=0.18->openml->openml-pytorch) (1.2.0)\n",
+ "Requirement already satisfied: MarkupSafe>=2.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from jinja2->torch<2.2.0,>=1.4.0->openml-pytorch) (2.1.5)\n",
+ "Requirement already satisfied: urllib3 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml->openml-pytorch) (1.26.15)\n",
+ "Requirement already satisfied: certifi in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml->openml-pytorch) (2022.12.7)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from requests->openml->openml-pytorch) (3.4)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from requests->openml->openml-pytorch) (3.1.0)\n",
+ "Requirement already satisfied: mpmath>=0.19 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from sympy->torch<2.2.0,>=1.4.0->openml-pytorch) (1.3.0)\n",
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
+ ]
+ }
+ ],
+ "source": [
+ "!pip install openml-pytorch"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# PyTorch sequential classification model example\n",
+ "An example of a sequential network that classifies digit images used as an OpenML flow.\n",
+ "We use sub networks here in order to show the that network hierarchies can be achieved with ease."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch.optim\n",
+ "\n",
+ "import openml\n",
+ "import openml_pytorch\n",
+ "import openml_pytorch.layers\n",
+ "import openml_pytorch.config\n",
+ "import logging\n",
+ "import torch.nn as nn\n",
+ "import torch.nn.functional as F\n",
+ "\n",
+ "from openml import OpenMLTask\n",
+ "import torchvision.models as models"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Enable logging in order to observe the progress while running the example.\n",
+ "openml.config.logger.setLevel(logging.DEBUG)\n",
+ "openml_pytorch.config.logger.setLevel(logging.DEBUG)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.\n",
+ " warnings.warn(\n",
+ "/Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.\n",
+ " warnings.warn(msg)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Load the pre-trained ResNet model\n",
+ "model = models.resnet18(pretrained=True, progress=True)\n",
+ "\n",
+ "# Modify the last fully connected layer to the required number of classes\n",
+ "num_classes = 20 # For the dataset we are using\n",
+ "in_features = model.fc.in_features\n",
+ "model.fc = nn.Linear(in_features, num_classes)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "# Optional: If you're fine-tuning, you may want to freeze the pre-trained layers\n",
+ "for param in model.parameters():\n",
+ " param.requires_grad = False\n",
+ "\n",
+ "# If you want to train the last layer only (the newly added layer)\n",
+ "for param in model.fc.parameters():\n",
+ " param.requires_grad = True"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Setting an appropriate optimizer \n",
+ "\n",
+ "def custom_optimizer_gen(model: torch.nn.Module, task: OpenMLTask) -> torch.optim.Optimizer:\n",
+ " return torch.optim.Adam(model.fc.parameters())\n",
+ "\n",
+ "openml_pytorch.config.optimizer_gen = custom_optimizer_gen"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n",
+ "/Users/eragon/Documents/CODE/Github/openml-pytorch/openml_pytorch/extension.py:154: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " df.loc[:, 'encoded_labels'] = label_encoder.transform(y)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Download the OpenML task for the Meta_Album_PNU_Micro dataset.\n",
+ "task = openml.tasks.get_task(361152)\n",
+ "\n",
+ "############################################################################\n",
+ "# Run the model on the task (requires an API key).m\n",
+ "run = openml.runs.run_model_on_task(model, task, avoid_duplicate_runs=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import openml"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "OpenML Flow\n",
+ "===========\n",
+ "Flow Name.......: torch.nn.ResNet.73f8a33b44a6743\n",
+ "Flow Description: Automatically created pytorch flow.\n",
+ "Dependencies....: torch==2.1.2\n",
+ "numpy>=1.6.1\n",
+ "scipy>=0.9\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(run.flow)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "# Publish the experiment on OpenML (optional, requires an API key).\n",
+ "run.publish()\n",
+ "\n",
+ "print('URL for run: %s/run/%d' % (openml.config.server, run.run_id))\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "openml",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.19"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/integrations/Pytorch/index.md b/docs/integrations/Pytorch/index.md
new file mode 100644
index 00000000..e69de29b
diff --git a/docs/integrations/Rest.md b/docs/integrations/Rest.md
new file mode 100644
index 00000000..d4a63166
--- /dev/null
+++ b/docs/integrations/Rest.md
@@ -0,0 +1,61 @@
+# REST API
+
+OpenML offers a RESTful Web API, with predictive URLs, for uploading and downloading machine learning resources. Try the API Documentation to see examples of all calls, and test them right in your browser.
+
+## Getting started
+
+* REST services can be called using simple HTTP GET or POST actions.
+* The REST Endpoint URL is https://www.openml.org/api/v1/
+* The default endpoint returns data in XML. If you prefer JSON, use the endpoint https://www.openml.org/api/v1/json/. Note that, to upload content, you still need to use XML (at least for now).
+
+## Testing
+For continuous integration and testing purposes, we have a test server offering the same API, but which does not affect the production server.
+
+* The test server REST Endpoint URL is https://test.openml.org/api/v1/
+
+## Error messages
+Error messages will look like this:
+
+```xml
+
+100
+Please invoke legal function
+Additional information, not always available.
+
+```
+
+All error messages are listed in the API documentation. E.g. try to get a non-existing dataset:
+
+* in XML: https://www.openml.org/api_new/v1/data/99999
+* in JSON: https://www.openml.org/api_new/v1/json/data/99999
+
+## Examples
+You need to be logged in for these examples to work.
+
+### Download a dataset
+
+
+* User asks for a dataset using the /data/{id} service. The dataset id is typically part of a task, or can be found on OpenML.org.
+* OpenML returns a description of the dataset as an XML file (or JSON). Try it now
+* The dataset description contains the URL where the dataset can be downloaded. The user calls that URL to download the dataset.
+* The dataset is returned by the server hosting the dataset. This can be OpenML, but also any other data repository. Try it now
+
+### Download a flow
+
+
+* User asks for a flow using the /flow/{id} service and a flow id. The flow id can be found on OpenML.org.
+* OpenML returns a description of the flow as an XML file (or JSON). Try it now
+* The flow description contains the URL where the flow can be downloaded (e.g. GitHub), either as source, binary or both, as well as additional information on history, dependencies and licence. The user calls the right URL to download it.
+* The flow is returned by the server hosting it. This can be OpenML, but also any other code repository. Try it now
+
+### Download a task
+
+
+* User asks for a task using the /task/{id} service and a task id. The task id is typically returned when searching for tasks.
+* OpenML returns a description of the task as an XML file (or JSON). Try it now
+* The task description contains the dataset id(s) of the datasets involved in this task. The user asks for the dataset using the /data/{id} service and the dataset id.
+* OpenML returns a description of the dataset as an XML file (or JSON). Try it now
+* The dataset description contains the URL where the dataset can be downloaded. The user calls that URL to download the dataset.
+* The dataset is returned by the server hosting it. This can be OpenML, but also any other data repository. Try it now
+* The task description may also contain links to other resources, such as the train-test splits to be used in cross-validation. The user calls that URL to download the train-test splits.
+* The train-test splits are returned by OpenML. Try it now
\ No newline at end of file
diff --git a/docs/integrations/Scikit-learn/basic_tutorial.ipynb b/docs/integrations/Scikit-learn/basic_tutorial.ipynb
new file mode 100644
index 00000000..8e1782d2
--- /dev/null
+++ b/docs/integrations/Scikit-learn/basic_tutorial.ipynb
@@ -0,0 +1,180 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ ""
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/markdown": [
+ "[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from IPython.display import display, HTML, Markdown\n",
+ "import os\n",
+ "import yaml\n",
+ "with open(\"../../../mkdocs.yml\", \"r\") as f:\n",
+ " load_config = yaml.safe_load(f)\n",
+ "repo_url = load_config[\"repo_url\"].replace(\"https://github.com/\", \"\")\n",
+ "binder_url = load_config[\"binder_url\"]\n",
+ "relative_file_path = \"integrations/Scikit-learn/basic_tutorial.ipynb\"\n",
+ "display(HTML(f\"\"\"\n",
+ " \n",
+ "\"\"\"))\n",
+ "display(Markdown(\"[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)\"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install openml"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import openml\n",
+ "from sklearn import impute, tree, pipeline"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages/openml/config.py:184: UserWarning: Switching to the test server https://test.openml.org/api/v1/xml to not upload results to the live server. Using the test server may result in reduced performance of the API!\n",
+ " warnings.warn(\n"
+ ]
+ }
+ ],
+ "source": [
+ "openml.config.start_using_configuration_for_example()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "# Define a scikit-learn classifier or pipeline\n",
+ "clf = pipeline.Pipeline(\n",
+ " steps=[\n",
+ " ('imputer', impute.SimpleImputer()),\n",
+ " ('estimator', tree.DecisionTreeClassifier())\n",
+ " ]\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "OpenML Classification Task\n",
+ "==========================\n",
+ "Task Type Description: https://test.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION\n",
+ "Task ID..............: 32\n",
+ "Task URL.............: https://test.openml.org/t/32\n",
+ "Estimation Procedure.: crossvalidation\n",
+ "Target Feature.......: class\n",
+ "# of Classes.........: 10\n",
+ "Cost Matrix..........: Available"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\n",
+ "# Download the OpenML task for the pendigits dataset with 10-fold\n",
+ "# cross-validation.\n",
+ "task = openml.tasks.get_task(32)\n",
+ "task"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Run the scikit-learn model on the task.\n",
+ "run = openml.runs.run_model_on_task(clf, task)\n",
+ "# Publish the experiment on OpenML (optional, requires an API key.\n",
+ "# You can get your own API key by signing up to OpenML.org)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "run.publish()\n",
+ "print(f'View the run online: {run.openml_url}')"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "openml",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.19"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/integrations/Scikit-learn/datasets_tutorial.ipynb b/docs/integrations/Scikit-learn/datasets_tutorial.ipynb
new file mode 100644
index 00000000..153d9d92
--- /dev/null
+++ b/docs/integrations/Scikit-learn/datasets_tutorial.ipynb
@@ -0,0 +1,1402 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ ""
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/markdown": [
+ "[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from IPython.display import display, HTML, Markdown\n",
+ "import os\n",
+ "import yaml\n",
+ "with open(\"../../../mkdocs.yml\", \"r\") as f:\n",
+ " load_config = yaml.safe_load(f)\n",
+ "repo_url = load_config[\"repo_url\"].replace(\"https://github.com/\", \"\")\n",
+ "binder_url = load_config[\"binder_url\"]\n",
+ "relative_file_path = \"integrations/Scikit-learn/datasets_tutorial.ipynb\"\n",
+ "display(HTML(f\"\"\"\n",
+ " \n",
+ "\"\"\"))\n",
+ "display(Markdown(\"[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)\"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: openml in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (0.14.2)\n",
+ "Requirement already satisfied: scikit-learn>=0.18 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (1.4.2)\n",
+ "Requirement already satisfied: requests in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (2.31.0)\n",
+ "Requirement already satisfied: liac-arff>=2.4.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (2.5.0)\n",
+ "Requirement already satisfied: numpy>=1.6.2 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (1.26.4)\n",
+ "Requirement already satisfied: minio in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (7.2.7)\n",
+ "Requirement already satisfied: pandas>=1.0.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (2.2.2)\n",
+ "Requirement already satisfied: scipy>=0.13.3 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (1.13.0)\n",
+ "Requirement already satisfied: pyarrow in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (16.0.0)\n",
+ "Requirement already satisfied: xmltodict in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (0.13.0)\n",
+ "Requirement already satisfied: python-dateutil in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from openml) (2.9.0.post0)\n",
+ "Requirement already satisfied: tzdata>=2022.7 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from pandas>=1.0.0->openml) (2024.1)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from pandas>=1.0.0->openml) (2024.1)\n",
+ "Requirement already satisfied: six>=1.5 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from python-dateutil->openml) (1.16.0)\n",
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from scikit-learn>=0.18->openml) (3.5.0)\n",
+ "Requirement already satisfied: joblib>=1.2.0 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from scikit-learn>=0.18->openml) (1.4.0)\n",
+ "Requirement already satisfied: urllib3 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml) (2.2.1)\n",
+ "Requirement already satisfied: typing-extensions in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml) (4.11.0)\n",
+ "Requirement already satisfied: pycryptodome in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml) (3.20.0)\n",
+ "Requirement already satisfied: certifi in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml) (2024.2.2)\n",
+ "Requirement already satisfied: argon2-cffi in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from minio->openml) (23.1.0)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from requests->openml) (3.7)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from requests->openml) (3.3.2)\n",
+ "Requirement already satisfied: argon2-cffi-bindings in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from argon2-cffi->minio->openml) (21.2.0)\n",
+ "Requirement already satisfied: cffi>=1.0.1 in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from argon2-cffi-bindings->argon2-cffi->minio->openml) (1.16.0)\n",
+ "Requirement already satisfied: pycparser in /Users/eragon/.pyenv/versions/3.9.19/envs/openml/lib/python3.9/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->minio->openml) (2.22)\n",
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
+ ]
+ }
+ ],
+ "source": [
+ "!pip install openml"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Datasets\n",
+ "\n",
+ "How to list and download datasets.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# License: BSD 3-Clauses\n",
+ "\n",
+ "import openml\n",
+ "import pandas as pd\n",
+ "from openml.datasets import edit_dataset, fork_dataset, get_dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Exercise 0\n",
+ "\n",
+ "* List datasets\n",
+ "\n",
+ " * Use the output_format parameter to select output type\n",
+ " * Default gives 'dict' (other option: 'dataframe', see below)\n",
+ "\n",
+ "Note: list_datasets will return a pandas dataframe by default from 0.15. When using\n",
+ "openml-python 0.14, `list_datasets` will warn you to use output_format='dataframe'.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "First 10 of 5466 datasets...\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "openml.config.start_using_configuration_for_example()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Edit non-critical fields, allowed for all authorized users:\n",
+ "description, creator, contributor, collection_date, language, citation,\n",
+ "original_data_url, paper_url\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "desc = (\n",
+ " \"This data sets consists of 3 different types of irises' \"\n",
+ " \"(Setosa, Versicolour, and Virginica) petal and sepal length,\"\n",
+ " \" stored in a 150x4 numpy.ndarray\"\n",
+ ")\n",
+ "did = 128\n",
+ "data_id = edit_dataset(\n",
+ " did,\n",
+ " description=desc,\n",
+ " creator=\"R.A.Fisher\",\n",
+ " collection_date=\"1937\",\n",
+ " citation=\"The use of multiple measurements in taxonomic problems\",\n",
+ " language=\"English\",\n",
+ ")\n",
+ "edited_dataset = get_dataset(data_id)\n",
+ "print(f\"Edited dataset ID: {data_id}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Editing critical fields (default_target_attribute, row_id_attribute, ignore_attribute) is allowed\n",
+ "only for the dataset owner. Further, critical fields cannot be edited if the dataset has any\n",
+ "tasks associated with it. To edit critical fields of a dataset (without tasks) owned by you,\n",
+ "configure the API key:\n",
+ "openml.config.apikey = 'FILL_IN_OPENML_API_KEY'\n",
+ "This example here only shows a failure when trying to work on a dataset not owned by you:\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "try:\n",
+ " data_id = edit_dataset(1, default_target_attribute=\"shape\")\n",
+ "except openml.exceptions.OpenMLServerException as e:\n",
+ " print(e)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Fork dataset\n",
+ "Used to create a copy of the dataset with you as the owner.\n",
+ "Use this API only if you are unable to edit the critical fields (default_target_attribute,\n",
+ "ignore_attribute, row_id_attribute) of a dataset through the edit_dataset API.\n",
+ "After the dataset is forked, you can edit the new version of the dataset using edit_dataset.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "data_id = fork_dataset(1)\n",
+ "print(data_id)\n",
+ "data_id = edit_dataset(data_id, default_target_attribute=\"shape\")\n",
+ "print(f\"Forked dataset ID: {data_id}\")\n",
+ "\n",
+ "openml.config.stop_using_configuration_for_example()"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.19"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/docs/integrations/Scikit-learn/index.md b/docs/integrations/Scikit-learn/index.md
new file mode 100644
index 00000000..19819202
--- /dev/null
+++ b/docs/integrations/Scikit-learn/index.md
@@ -0,0 +1,68 @@
+# scikit-learn
+
+OpenML is readily integrated with scikit-learn through the [Python API](https://openml.github.io/openml-python/main/api.html).
+This page provides a brief overview of the key features and installation instructions. For more detailed API documentation, please refer to the [official documentation](https://openml.github.io/openml-python/main/api.html).
+
+## Key features:
+
+- Query and download OpenML datasets and use them however you like
+- Build any sklearn estimator or pipeline and convert to OpenML flows
+- Run any flow on any task and save the experiment as run objects
+- Upload your runs for collaboration or publishing
+- Query, download and reuse all shared runs
+
+## Installation
+
+```bash
+pip install openml
+```
+
+## Query and download data
+```python
+import openml
+
+# List all datasets and their properties
+openml.datasets.list_datasets(output_format="dataframe")
+
+# Get dataset by ID
+dataset = openml.datasets.get_dataset(61)
+
+# Get dataset by name
+dataset = openml.datasets.get_dataset('Fashion-MNIST')
+
+# Get the data itself as a dataframe (or otherwise)
+X, y, _, _ = dataset.get_data(dataset_format="dataframe")
+```
+
+## Download tasks, run models locally, publish results (with scikit-learn)
+```python
+from sklearn import ensemble
+from openml import tasks, runs
+
+# Build any model you like
+clf = ensemble.RandomForestClassifier()
+
+# Download any OpenML task
+task = tasks.get_task(3954)
+
+# Run and evaluate your model on the task
+run = runs.run_model_on_task(clf, task)
+
+# Share the results on OpenML. Your API key can be found in your account.
+# openml.config.apikey = 'YOUR_KEY'
+run.publish()
+```
+
+## OpenML Benchmarks
+```python
+# List all tasks in a benchmark
+benchmark = openml.study.get_suite('OpenML-CC18')
+tasks.list_tasks(output_format="dataframe", task_id=benchmark.tasks)
+
+# Return benchmark results
+openml.evaluations.list_evaluations(
+ function="area_under_roc_curve",
+ tasks=benchmark.tasks,
+ output_format="dataframe"
+)
+```
diff --git a/docs/integrations/Tensorflow.md b/docs/integrations/Tensorflow.md
new file mode 100644
index 00000000..e69de29b
diff --git a/docs/Weka.md b/docs/integrations/Weka.md
similarity index 100%
rename from docs/Weka.md
rename to docs/integrations/Weka.md
diff --git a/docs/integrations/apikey.md b/docs/integrations/apikey.md
new file mode 100644
index 00000000..225a673d
--- /dev/null
+++ b/docs/integrations/apikey.md
@@ -0,0 +1,19 @@
+# Authentication
+
+The OpenML server can only be accessed by users who have signed up on the
+OpenML platform. If you don’t have an account yet, sign up now.
+You will receive an API key, which will authenticate you to the server
+and allow you to download and upload datasets, tasks, runs and flows.
+
+* Create an OpenML account (free) on https://www.openml.org.
+* After logging in, open your account page (avatar on the top right)
+* Open 'Account Settings', then 'API authentication' to find your API key.
+
+There are two ways to permanently authenticate:
+
+* Use the ``openml`` CLI tool with ``openml configure apikey MYKEY``,
+ replacing **MYKEY** with your API key.
+* Create a plain text file **~/.openml/config** with the line
+ **'apikey=MYKEY'**, replacing **MYKEY** with your API key. The config
+ file must be in the directory ~/.openml/config and exist prior to
+ importing the openml module.
\ No newline at end of file
diff --git a/docs/integrations/creating_extensions.md b/docs/integrations/creating_extensions.md
new file mode 100644
index 00000000..82fcf822
--- /dev/null
+++ b/docs/integrations/creating_extensions.md
@@ -0,0 +1,169 @@
+# Creating an Extension
+
+OpenML-Python provides an extension interface to connect other machine
+learning libraries than scikit-learn to OpenML. Please check the
+`api_extensions`{.interpreted-text role="ref"} and use the scikit-learn
+extension in
+`openml.extensions.sklearn.SklearnExtension`{.interpreted-text
+role="class"} as a starting point.
+
+## Connecting new machine learning libraries
+
+### Content of the Library
+
+To leverage support from the community and to tap in the potential of
+OpenML, interfacing with popular machine learning libraries is
+essential. The OpenML-Python package is capable of downloading meta-data
+and results (data, flows, runs), regardless of the library that was used
+to upload it. However, in order to simplify the process of uploading
+flows and runs from a specific library, an additional interface can be
+built. The OpenML-Python team does not have the capacity to develop and
+maintain such interfaces on its own. For this reason, we have built an
+extension interface to allows others to contribute back. Building a
+suitable extension for therefore requires an understanding of the
+current OpenML-Python support.
+
+The
+`sphx_glr_examples_20_basic_simple_flows_and_runs_tutorial.py`{.interpreted-text
+role="ref"} tutorial shows how scikit-learn currently works with
+OpenML-Python as an extension. The *sklearn* extension packaged with the
+[openml-python](https://github.com/openml/openml-python) repository can
+be used as a template/benchmark to build the new extension.
+
+#### API
+
+- The extension scripts must import the [openml]{.title-ref} package
+ and be able to interface with any function from the OpenML-Python
+ `api`{.interpreted-text role="ref"}.
+- The extension has to be defined as a Python class and must inherit
+ from `openml.extensions.Extension`{.interpreted-text role="class"}.
+- This class needs to have all the functions from [class
+ Extension]{.title-ref} overloaded as required.
+- The redefined functions should have adequate and appropriate
+ docstrings. The [Sklearn Extension API
+ :class:\`openml.extensions.sklearn.SklearnExtension.html]{.title-ref}
+ is a good example to follow.
+
+#### Interfacing with OpenML-Python
+
+Once the new extension class has been defined, the openml-python module
+to `openml.extensions.register_extension`{.interpreted-text role="meth"}
+must be called to allow OpenML-Python to interface the new extension.
+
+The following methods should get implemented. Although the documentation
+in the [Extension]{.title-ref} interface should always be leading, here
+we list some additional information and best practices. The [Sklearn
+Extension API
+:class:\`openml.extensions.sklearn.SklearnExtension.html]{.title-ref} is
+a good example to follow. Note that most methods are relatively simple
+and can be implemented in several lines of code.
+
+- General setup (required)
+ - `can_handle_flow`{.interpreted-text role="meth"}: Takes as
+ argument an OpenML flow, and checks whether this can be handled
+ by the current extension. The OpenML database consists of many
+ flows, from various workbenches (e.g., scikit-learn, Weka, mlr).
+ This method is called before a model is being deserialized.
+ Typically, the flow-dependency field is used to check whether
+ the specific library is present, and no unknown libraries are
+ present there.
+ - `can_handle_model`{.interpreted-text role="meth"}: Similar as
+ `can_handle_flow`{.interpreted-text role="meth"}, except that in
+ this case a Python object is given. As such, in many cases, this
+ method can be implemented by checking whether this adheres to a
+ certain base class.
+- Serialization and De-serialization (required)
+ - `flow_to_model`{.interpreted-text role="meth"}: deserializes the
+ OpenML Flow into a model (if the library can indeed handle the
+ flow). This method has an important interplay with
+ `model_to_flow`{.interpreted-text role="meth"}. Running these
+ two methods in succession should result in exactly the same
+ model (or flow). This property can be used for unit testing
+ (e.g., build a model with hyperparameters, make predictions on a
+ task, serialize it to a flow, deserialize it back, make it
+ predict on the same task, and check whether the predictions are
+ exactly the same.) The example in the scikit-learn interface
+ might seem daunting, but note that here some complicated design
+ choices were made, that allow for all sorts of interesting
+ research questions. It is probably good practice to start easy.
+ - `model_to_flow`{.interpreted-text role="meth"}: The inverse of
+ `flow_to_model`{.interpreted-text role="meth"}. Serializes a
+ model into an OpenML Flow. The flow should preserve the class,
+ the library version, and the tunable hyperparameters.
+ - `get_version_information`{.interpreted-text role="meth"}: Return
+ a tuple with the version information of the important libraries.
+ - `create_setup_string`{.interpreted-text role="meth"}: No longer
+ used, and will be deprecated soon.
+- Performing runs (required)
+ - `is_estimator`{.interpreted-text role="meth"}: Gets as input a
+ class, and checks whether it has the status of estimator in the
+ library (typically, whether it has a train method and a predict
+ method).
+ - `seed_model`{.interpreted-text role="meth"}: Sets a random seed
+ to the model.
+ - `_run_model_on_fold`{.interpreted-text role="meth"}: One of the
+ main requirements for a library to generate run objects for the
+ OpenML server. Obtains a train split (with labels) and a test
+ split (without labels) and the goal is to train a model on the
+ train split and return the predictions on the test split. On top
+ of the actual predictions, also the class probabilities should
+ be determined. For classifiers that do not return class
+ probabilities, this can just be the hot-encoded predicted label.
+ The predictions will be evaluated on the OpenML server. Also,
+ additional information can be returned, for example,
+ user-defined measures (such as runtime information, as this can
+ not be inferred on the server). Additionally, information about
+ a hyperparameter optimization trace can be provided.
+ - `obtain_parameter_values`{.interpreted-text role="meth"}:
+ Obtains the hyperparameters of a given model and the current
+ values. Please note that in the case of a hyperparameter
+ optimization procedure (e.g., random search), you only should
+ return the hyperparameters of this procedure (e.g., the
+ hyperparameter grid, budget, etc) and that the chosen model will
+ be inferred from the optimization trace.
+ - `check_if_model_fitted`{.interpreted-text role="meth"}: Check
+ whether the train method of the model has been called (and as
+ such, whether the predict method can be used).
+- Hyperparameter optimization (optional)
+ - `instantiate_model_from_hpo_class`{.interpreted-text
+ role="meth"}: If a given run has recorded the hyperparameter
+ optimization trace, then this method can be used to
+ reinstantiate the model with hyperparameters of a given
+ hyperparameter optimization iteration. Has some similarities
+ with `flow_to_model`{.interpreted-text role="meth"} (as this
+ method also sets the hyperparameters of a model). Note that
+ although this method is required, it is not necessary to
+ implement any logic if hyperparameter optimization is not
+ implemented. Simply raise a [NotImplementedError]{.title-ref}
+ then.
+
+### Hosting the library
+
+Each extension created should be a stand-alone repository, compatible
+with the [OpenML-Python
+repository](https://github.com/openml/openml-python). The extension
+repository should work off-the-shelf with *OpenML-Python* installed.
+
+Create a [public Github
+repo](https://docs.github.com/en/github/getting-started-with-github/create-a-repo)
+with the following directory structure:
+
+ | [repo name]
+ | |-- [extension name]
+ | | |-- __init__.py
+ | | |-- extension.py
+ | | |-- config.py (optionally)
+
+### Recommended
+
+- Test cases to keep the extension up to date with the
+ [openml-python]{.title-ref} upstream changes.
+- Documentation of the extension API, especially if any new
+ functionality added to OpenML-Python\'s extension design.
+- Examples to show how the new extension interfaces and works with
+ OpenML-Python.
+- Create a PR to add the new extension to the OpenML-Python API
+ documentation.
+
+Happy contributing!
+
diff --git a/docs/integrations/getting_started.ipynb b/docs/integrations/getting_started.ipynb
new file mode 100644
index 00000000..81fbb561
--- /dev/null
+++ b/docs/integrations/getting_started.ipynb
@@ -0,0 +1,222 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ ""
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/markdown": [
+ "[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from IPython.display import display, HTML, Markdown\n",
+ "import os\n",
+ "import yaml\n",
+ "with open(\"../../mkdocs.yml\", \"r\") as f:\n",
+ " load_config = yaml.safe_load(f)\n",
+ "repo_url = load_config[\"repo_url\"].replace(\"https://github.com/\", \"\")\n",
+ "binder_url = load_config[\"binder_url\"]\n",
+ "relative_file_path = \"integrations/getting_started.ipynb\"\n",
+ "display(HTML(f\"\"\"\n",
+ " \n",
+ "\"\"\"))\n",
+ "display(Markdown(\"[](https://mybinder.org/v2/gh/SubhadityaMukherjee/openml_docs/HEAD?labpath=Scikit-learn%2Fdatasets_tutorial)\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Getting Started\n",
+ "\n",
+ "This page will guide you through the process of getting started with OpenML. While this page is a good starting point, for more detailed information, please refer to the [integrations section](Scikit-learn/index.md) and the rest of the documentation.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Authentication\n",
+ "- If you are using the OpenML API to download datasets, upload results, or create tasks, you will need to authenticate. You can do this by creating an account on the OpenML website and using your API key. - You can find detailed instructions on how to authenticate in the [authentication section](apikey.md)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install openml"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## EEG Eye State example\n",
+ "Download the OpenML task for the eeg-eye-state.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# License: BSD 3-Clause\n",
+ "\n",
+ "import openml\n",
+ "from sklearn import neighbors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "
Warning
.. include:: ../../test_server_usage_warning.txt
\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "openml.config.start_using_configuration_for_example()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When using the main server instead, make sure your apikey is configured.\n",
+ "This can be done with the following line of code (uncomment it!).\n",
+ "Never share your apikey with others.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# openml.config.apikey = 'YOURKEY'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Caching\n",
+ "When downloading datasets, tasks, runs and flows, they will be cached to\n",
+ "retrieve them without calling the server later. As with the API key,\n",
+ "the cache directory can be either specified through the config file or\n",
+ "through the API:\n",
+ "\n",
+ "* Add the line **cachedir = 'MYDIR'** to the config file, replacing\n",
+ " 'MYDIR' with the path to the cache directory. By default, OpenML\n",
+ " will use **~/.openml/cache** as the cache directory.\n",
+ "* Run the code below, replacing 'YOURDIR' with the path to the cache directory.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# Uncomment and set your OpenML cache directory\n",
+ "# import os\n",
+ "# openml.config.cache_directory = os.path.expanduser('YOURDIR')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "task = openml.tasks.get_task(403)\n",
+ "data = openml.datasets.get_dataset(task.dataset_id)\n",
+ "clf = neighbors.KNeighborsClassifier(n_neighbors=5)\n",
+ "run = openml.runs.run_model_on_task(clf, task, avoid_duplicate_runs=False)\n",
+ "# Publish the experiment on OpenML (optional, requires an API key).\n",
+ "# For this tutorial, our configuration publishes to the test server\n",
+ "# as to not crowd the main server with runs created by examples.\n",
+ "myrun = run.publish()\n",
+ "print(f\"kNN on {data.name}: {myrun.openml_url}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "openml.config.stop_using_configuration_for_example()"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.19"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/docs/integrations/index.md b/docs/integrations/index.md
new file mode 100644
index 00000000..16b0f68e
--- /dev/null
+++ b/docs/integrations/index.md
@@ -0,0 +1,3 @@
+# Integrations
+
+- Overview
\ No newline at end of file
diff --git a/docs/mlr.md b/docs/integrations/mlr.md
similarity index 100%
rename from docs/mlr.md
rename to docs/integrations/mlr.md
diff --git a/docs/scripts/github_scraper.py b/docs/scripts/github_scraper.py
new file mode 100644
index 00000000..eaedb69e
--- /dev/null
+++ b/docs/scripts/github_scraper.py
@@ -0,0 +1,132 @@
+"""
+Script to scrape the github repositories of the projects in the showcase_urls.txt file and generate a markdown file with a grid of cards with the information of the repositories.
+
+Does not rely on the GitHub API, so it is limited to the information that can be scraped from the GitHub website.
+
+Inspired in part from https://brightdata.com/blog/how-tos/how-to-scrape-github-repositories-in-python
+"""
+
+import requests
+from bs4 import BeautifulSoup
+from tqdm import tqdm
+
+with open("scripts/showcase_urls.txt", "r") as file:
+ target_urls = file.readlines()
+ target_urls = [url.strip() for url in target_urls]
+main_info = """# Showcase
+
+This page is a showcase of some projects and research done using the OpenML libary. Did you use OpenML in your work and want to share it with the community? We would love to have you!
+
+Simply create a pull request with the necessary information and we will add it to this page.\n"""
+
+
+def get_github_info(target_url):
+ """
+ Get the name, description and number of stars of a GitHub repository from its URL.
+ """
+ print(target_url)
+ page = requests.get(target_url)
+ soup = BeautifulSoup(page.text, "html.parser")
+ name_html_element = soup.select_one('[itemprop="name"]')
+ name = name_html_element.text.strip()
+
+ bordergrid_html_element = soup.select_one(".BorderGrid")
+ about_html_element = bordergrid_html_element.select_one("h2")
+ description_html_element = about_html_element.find_next_sibling("p")
+ description = description_html_element.get_text().strip()
+
+ star_icon_html_element = bordergrid_html_element.select_one(".octicon-star")
+ stars_html_element = star_icon_html_element.find_next_sibling("strong")
+ stars = stars_html_element.get_text().strip().replace(",", "")
+
+ return name, description, stars
+
+
+def return_details(target_urls):
+ """
+ For a list of GitHub URLs, return a dictionary with the name, description and number of stars of the repositories.
+ """
+ target_urls = list(set(target_urls)) # remove duplicates
+ urls = {}
+ for target_url in target_urls:
+ name, description, stars = get_github_info(target_url)
+ if len(name) > 0:
+ urls[target_url] = {
+ "name": name,
+ "description": description,
+ "stars": stars,
+ }
+ # sort by stars
+ urls = dict(
+ sorted(urls.items(), key=lambda item: int(item[1]["stars"]), reverse=True)
+ )
+ return urls
+
+
+def return_div(url, urls):
+ """
+ Return a div element with the information of a GitHub repository. Creates a card with the name, description and number of stars of the repository.
+
+ Example CSS
+
+ .card-container {
+ display: flex;
+ flex-wrap: wrap;
+ gap: 20px;
+ justify-content: center;
+ }
+
+ .card {
+ border: 1px solid #ccc;
+ border-radius: 5px;
+ padding: 20px;
+ width: 300px;
+ box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
+ }
+
+ .card h2 {
+ margin-top: 0;
+ }
+
+ .card p {
+ margin-bottom: 0;}
+
+ .github-logo {
+ height: 15px;
+ width: 13px;
+ margin-left: 10px;
+ }
+
+ iframe[seamless] {
+ border: none;
+ }
+ """
+ info = urls[url]
+ return f"""
+ \n
\n
+ """
+
+
+def generate_page(info):
+ """
+ Generate a page with a grid of cards with the information of the repositories.
+ """
+
+ page = """
\n"""
+ for target_url in tqdm(info.keys(), total=len(info)):
+ page += return_div(target_url, info)
+ page += "
"
+ return page
+
+
+info = return_details(target_urls)
+# print(generate_page(info))
+with open("showcase.md", "w") as file:
+ file.write(main_info)
+ file.write(generate_page(info))
+
+# test = ["https://github.com/openml/openml-python"]
+# print(return_details(test))
diff --git a/docs/scripts/showcase_urls.txt b/docs/scripts/showcase_urls.txt
new file mode 100644
index 00000000..8c344fbc
--- /dev/null
+++ b/docs/scripts/showcase_urls.txt
@@ -0,0 +1,17 @@
+https://github.com/openml/openml-croissant
+https://github.com/openml/flow-visualization
+https://github.com/openml/continual-automl
+https://github.com/openml/openml-mxnet
+https://github.com/openml/openml-onnx
+https://github.com/openml/openml-azure
+https://github.com/openml/openml-rapidminer
+https://github.com/openml/OpenmlCortana
+https://github.com/openml/openml-python
+https://github.com/openml/openml-pytorch
+https://github.com/openml/openml-keras
+https://github.com/openml/openml-tensorflow
+https://github.com/openml/openml-R
+https://github.com/JuliaAI/OpenML.jl/tree/master
+https://github.com/mbillingr/openml-rust
+https://github.com/openml/openml-dotnet
+https://github.com/openml/automlbenchmark/
\ No newline at end of file
diff --git a/docs/showcase/index.md b/docs/showcase/index.md
new file mode 100644
index 00000000..1f763e88
--- /dev/null
+++ b/docs/showcase/index.md
@@ -0,0 +1,119 @@
+# Showcase
+
+This page is a showcase of some projects and research done using the OpenML libary. Did you use OpenML in your work and want to share it with the community? We would love to have you!
+
+Simply create a pull request with the necessary information and we will add it to this page.
+