Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs update #26

Merged
merged 6 commits into from
Sep 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 0 additions & 12 deletions DOCS.md

This file was deleted.

71 changes: 39 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
*///////////////.
//////////////////////*
*/////////////////////////.
////////////// */////////////
/////////* /////////
////// ///// ////, /////
//////// /// /////////
///// ///// .///// ////*
,//// ////
*//// ////.
///////*///////


░█████╗░██╗░░░██╗████████╗██████╗░░█████╗░███╗░░██╗██╗░░██╗
██╔══██╗██║░░░██║╚══██╔══╝██╔══██╗██╔══██╗████╗░██║██║░██╔╝
Expand All @@ -6,49 +18,44 @@
╚█████╔╝╚██████╔╝░░░██║░░░██║░░██║██║░░██║██║░╚███║██║░╚██╗
░╚════╝░░╚═════╝░░░░╚═╝░░░╚═╝░░╚═╝╚═╝░░╚═╝╚═╝░░╚══╝╚═╝░░╚═╝


[![CI - package](https://github.com/outbrain/outrank/actions/workflows/python-package.yml/badge.svg)](https://github.com/outbrain/outrank/actions/workflows/python-package.yml) [![CI - benchmark](https://github.com/outbrain/outrank/actions/workflows/benchmarks.yml/badge.svg)](https://github.com/outbrain/outrank/actions/workflows/benchmarks.yml) [![CI - selftest](https://github.com/outbrain/outrank/actions/workflows/selftest.yml/badge.svg)](https://github.com/outbrain/outrank/actions/workflows/selftest.yml)
# Feature interaction module

This tool enables fast screening of feature-feature interactions. Its purpose is to give the user fast insight into potential redundancies/anomalies in the data.
It is implemented to operate in _mini batches_, it traverses the `raw data` incrementally, refining the rankings as it goes along.
The interaction ranking outputs triplets which look as follows:

```
featureA featureB 0.512
featureA featureC 0.125
```


# Use - CLI
```bash
pip install outrank
```

and test a minimal cycle with

```bash
outrank --task selftest
```

if this passes, you can be pretty certain OutRank will perform as intended.

OutRank's primary use case is as a CLI tool, begin exploring with

```bash
outrank --help
```
# TLDR
> The design of modern recommender systems relies on understanding which parts of the feature space are relevant for solving a given recommendation task. However, real-world data sets in this domain are often characterized by their large size, sparsity, and noise, making it challenging to identify meaningful signals. Feature ranking represents an efficient branch of algorithms that can help address these challenges by identifying the most informative features and facilitating the automated search for more compact and better-performing models (AutoML). We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. OutRank was built with categorical data in mind, utilizing a variant of mutual information that is normalized with regard to the noise produced by features of the same cardinality. We further extend the similarity measure by incorporating information on feature similarity and combined relevance.

A minimal showcase is demonstrated with [this example](./scripts/run_minimal.sh)
# Getting started
Minimal examples and an interface to explore OutRank's functionality are available as [the docs](https://outbrain.github.io/outrank).

# Contributing
1. Make sure the functionality is not already implemented!
2. Decide whether where the functionality would fit best (is it an algorithm? A parser?)
3. Open a PR with rationale


# Bugs and other reports
Feel free to open a PR that contains:
1. Issue overview
2. Minimal example useful for replicating the issue on our end
3. Possible solution

# Citing this work
If you use or build on top of OutRank, feel free to cite:

```
@inproceedings{10.1145/3604915.3610636,
author = {Skrlj, Blaz and Mramor, Bla\v{z}},
title = {OutRank: Speeding up AutoML-Based Model Search for Large Sparse Data Sets with Cardinality-Aware Feature Ranking},
year = {2023},
isbn = {9798400702419},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3604915.3610636},
doi = {10.1145/3604915.3610636},
abstract = {The design of modern recommender systems relies on understanding which parts of the feature space are relevant for solving a given recommendation task. However, real-world data sets in this domain are often characterized by their large size, sparsity, and noise, making it challenging to identify meaningful signals. Feature ranking represents an efficient branch of algorithms that can help address these challenges by identifying the most informative features and facilitating the automated search for more compact and better-performing models (AutoML). We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. OutRank was built with categorical data in mind, utilizing a variant of mutual information that is normalized with regard to the noise produced by features of the same cardinality. We further extend the similarity measure by incorporating information on feature similarity and combined relevance. The proposed approach’s feasibility is demonstrated by speeding up the state-of-the-art AutoML system on a synthetic data set with no performance loss. Furthermore, we considered a real-life click-through-rate prediction data set where it outperformed strong baselines such as random forest-based approaches. The proposed approach enables exploration of up to 300\% larger feature spaces compared to AutoML-only approaches, enabling faster search for better models on off-the-shelf hardware.},
booktitle = {Proceedings of the 17th ACM Conference on Recommender Systems},
pages = {1078–1083},
numpages = {6},
keywords = {Feature ranking, massive data sets, AutoML, recommender systems},
location = {Singapore, Singapore},
series = {RecSys '23}
}
```
35 changes: 35 additions & 0 deletions docs/DOCSMAIN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Welcome to OutRank's documentation!

All functions/methods can be searched-for (search bar on the left).

This tool enables fast screening of feature-feature interactions. Its purpose is to give the user fast insight into potential redundancies/anomalies in the data.
It is implemented to operate in _mini batches_, it traverses the `raw data` incrementally, refining the rankings as it goes along. The core operation, interaction ranking, outputs triplets which look as follows:

```
featureA featureB 0.512
featureA featureC 0.125
```


# Use and installation - first steps (OutRank as a CLI)
```bash
pip install outrank
```

and test a minimal cycle with

```bash
outrank --task selftest
```

if this passes, you can be pretty certain OutRank will perform as intended. OutRank's primary use case is as a CLI tool, begin exploring with

```bash
outrank --help
```


# Example use cases
* A minimal showcase of performing feature ranking on a generic CSV is demonstrated with [this example](../scripts/run_minimal.sh)

* [More examples](../scripts/) demonstrating OutRank's capabilities are also available.
3 changes: 2 additions & 1 deletion docs/build_docs.sh
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
cd ..; rm -rf docs; pdoc ./outrank -o docs;
# Note: this requires pdoc>=14.1.0 to run
rm -rvf index.html outrank outrank.html search.js; cd ..; pdoc ./outrank -o docs;
61 changes: 40 additions & 21 deletions docs/outrank.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
<h2>Contents</h2>
<ul>
<li><a href="#welcome-to-outranks-documentation">Welcome to OutRank's documentation!</a></li>
<li><a href="#use-and-installation-first-steps-outrank-as-a-cli">Use and installation - first steps (OutRank as a CLI)</a></li>
<li><a href="#example-use-cases">Example use cases</a></li>
</ul>


Expand Down Expand Up @@ -56,36 +58,53 @@ <h2>Submodules</h2>
<h1 class="modulename">
outrank </h1>

<div class="docstring"><pre><code>░█████╗░██╗░░░██╗████████╗██████╗░░█████╗░███╗░░██╗██╗░░██╗
██╔══██╗██║░░░██║╚══██╔══╝██╔══██╗██╔══██╗████╗░██║██║░██╔╝
██║░░██║██║░░░██║░░░██║░░░██████╔╝███████║██╔██╗██║█████═╝░
██║░░██║██║░░░██║░░░██║░░░██╔══██╗██╔══██║██║╚████║██╔═██╗░
╚█████╔╝╚██████╔╝░░░██║░░░██║░░██║██║░░██║██║░╚███║██║░╚██╗
░╚════╝░░╚═════╝░░░░╚═╝░░░╚═╝░░╚═╝╚═╝░░╚═╝╚═╝░░╚══╝╚═╝░░╚═╝
<div class="docstring"><h1 id="welcome-to-outranks-documentation">Welcome to OutRank's documentation!</h1>

<p>All functions/methods can be searched-for (search bar on the left).</p>

<p>This tool enables fast screening of feature-feature interactions. Its purpose is to give the user fast insight into potential redundancies/anomalies in the data.
It is implemented to operate in _mini batches_, it traverses the <code>raw data</code> incrementally, refining the rankings as it goes along. The core operation, interaction ranking, outputs triplets which look as follows:</p>

<pre><code>featureA featureB 0.512
featureA featureC 0.125
</code></pre>

<h1 id="welcome-to-outranks-documentation">Welcome to OutRank's documentation!</h1>
<h1 id="use-and-installation-first-steps-outrank-as-a-cli">Use and installation - first steps (OutRank as a CLI)</h1>

<p>All functions/methods can be searched-for (search bar on the left).</p>
<div class="pdoc-code codehilite">
<pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>outrank
</code></pre>
</div>

<p>and test a minimal cycle with</p>

<div class="pdoc-code codehilite">
<pre><span></span><code>outrank<span class="w"> </span>--task<span class="w"> </span>selftest
</code></pre>
</div>

<p>if this passes, you can be pretty certain OutRank will perform as intended. OutRank's primary use case is as a CLI tool, begin exploring with</p>

<div class="pdoc-code codehilite">
<pre><span></span><code>outrank<span class="w"> </span>--help
</code></pre>
</div>

<h1 id="example-use-cases">Example use cases</h1>

<ul>
<li><p>A minimal showcase of performing feature ranking on a generic CSV is demonstrated with <a href="../scripts/run_minimal.sh">this example</a></p></li>
<li><p><a href="../scripts/">More examples</a> demonstrating OutRank's capabilities are also available.</p></li>
</ul>
</div>

<input id="mod-outrank-view-source" class="view-source-toggle-state" type="checkbox" aria-hidden="true" tabindex="-1">

<label class="view-source-button" for="mod-outrank-view-source"><span>View Source</span></label>

<div class="pdoc-code codehilite"><pre><span></span><span id="L-1"><a href="#L-1"><span class="linenos"> 1</span></a><span class="sd">&quot;&quot;&quot;</span>
</span><span id="L-2"><a href="#L-2"><span class="linenos"> 2</span></a><span class="sd">.. include:: ../DOCS.md</span>
</span><span id="L-3"><a href="#L-3"><span class="linenos"> 3</span></a><span class="sd">&quot;&quot;&quot;</span>
</span><span id="L-4"><a href="#L-4"><span class="linenos"> 4</span></a>
</span><span id="L-5"><a href="#L-5"><span class="linenos"> 5</span></a><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">annotations</span>
</span><span id="L-6"><a href="#L-6"><span class="linenos"> 6</span></a>
</span><span id="L-7"><a href="#L-7"><span class="linenos"> 7</span></a><span class="kn">import</span> <span class="nn">logging</span>
</span><span id="L-8"><a href="#L-8"><span class="linenos"> 8</span></a>
</span><span id="L-9"><a href="#L-9"><span class="linenos"> 9</span></a><span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span>
</span><span id="L-10"><a href="#L-10"><span class="linenos">10</span></a> <span class="nb">format</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">%(asctime)s</span><span class="s1"> - </span><span class="si">%(message)s</span><span class="s1">&#39;</span><span class="p">,</span>
</span><span id="L-11"><a href="#L-11"><span class="linenos">11</span></a> <span class="n">datefmt</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">%d</span><span class="s1">-%b-%y %H:%M:%S&#39;</span><span class="p">,</span>
</span><span id="L-12"><a href="#L-12"><span class="linenos">12</span></a><span class="p">)</span>
</span><span id="L-13"><a href="#L-13"><span class="linenos">13</span></a><span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="vm">__name__</span><span class="p">)</span><span class="o">.</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>
<div class="pdoc-code codehilite"><pre><span></span><span id="L-1"><a href="#L-1"><span class="linenos">1</span></a><span class="sd">&quot;&quot;&quot;</span>
</span><span id="L-2"><a href="#L-2"><span class="linenos">2</span></a><span class="sd">.. include:: ../docs/DOCSMAIN.md</span>
</span><span id="L-3"><a href="#L-3"><span class="linenos">3</span></a><span class="sd">&quot;&quot;&quot;</span>
</span></pre></div>


Expand Down
Loading