Skip to content

Commit

Permalink
Initial version of documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
bartvanb committed Apr 5, 2024
1 parent 7ecae18 commit 0f7bb7f
Show file tree
Hide file tree
Showing 9 changed files with 217 additions and 32 deletions.
10 changes: 9 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
# documentation root, use os.path.abspath to make it absolute, like shown here.
#

# import os
import os

# import sys
# sys.path.insert(0, os.path.abspath('../package'))

Expand All @@ -29,6 +30,7 @@
"sphinx.ext.autosectionlabel",
"sphinx.ext.intersphinx",
"sphinx_click.ext",
"sphinxcontrib.plantuml",
]


Expand All @@ -54,3 +56,9 @@
pygments_style = None

numfig = False

plantuml_output_format = "svg_img"
local_plantuml_path = os.path.join(
os.path.dirname(__file__), "static", "java", "plantuml.jar"
)
plantuml = f"java -Djava.awt.headless=true -jar {local_plantuml_path}"
10 changes: 7 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,21 @@ Overview
Description
-----------

.. Give short description of the algorithm on this main page of the docs
This algorithm computes a cross-table a.ka.
`contingency table <https://en.wikipedia.org/wiki/Contingency_table>`_ for two or more
categorical variables. The algorithm takes categorical variables as input and returns a
table with the counts of the number of occurrences of each combination of categories.

Authors
-------

.. List authors.
Bart van Beusekom, Frank Martin, Hasan Alradhi, *Netherlands Comphrensive Cancer Organisation (IKNL).*

Source code
-----------

.. Describe where to find source code and docker files
Source code is available in the following
`GitHub repository <https://github.com/vantage6/v6-crosstab-py>`_.


Contents
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ sphinx==5.3.0
sphinx-autobuild
sphinx-autodoc-typehints
sphinx-click==4.4.0
sphinxcontrib-plantuml
furo==2022.12.7
Binary file added docs/static/java/plantuml.jar
Binary file not shown.
68 changes: 62 additions & 6 deletions docs/v6-crosstab-py/implementation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,76 @@ Implementation
Overview
--------

Central (``central_crosstab``)
-----------------
The central part is responsible for the orchestration and aggregation of the algorithm.
The implementation is rather straightforward. The central part requests the partial
contingency table from each node, which they compute in one go. The central part then
aggregates the partial contingency tables to the final table.

.. uml::

!theme superhero-outline

caption The central part of the algorithm is responsible for the \
orchestration and aggregation\n of the algorithm. The partial \
parts are executed on each node.

|client|
:request analysis;

|central|
:Collect organizations
in collaboration;
:Create partial tasks;

|partial|
:Partial_crosstab creates
partial contingency tables;

|partial|
:Mask values below
privacy threshold;

|central|
:Combine contingency
tables;

|client|
:Receive results;

.. Describe the central function here.

Partials
--------

Partials are the computations that are executed on each node. The partials have access
to the data that is stored on the node. The partials are executed in parallel on each
node.

``partial_crosstab``
~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~

The partial function computes the local contingency table. Any values below the privacy threshold
are converted to a range. For example, if the privacy threshold is 5, then all values
below 5 are converted to '0-4'. The local contingency table is sent to the central part.

The partial function includes several privacy checks - see the
:ref:`privacy guards <privacy-guards>` section for more information.

Central
-------

The central part is responsible for the aggregation of the cross-tables of individual
nodes.

``central_crosstab``
~~~~~~~~~~~~~~~~~~~~

The central part sums up the local contingency tables resulting in the global
contingency table.

.. Describe the partial function.
- If the local contingency table contains values below the privacy threshold, the
central part will show a range instead of the actual value. For example, if the
privacy threshold is 5, and the two nodes included in an analysis report back 0-4
and 11, the central part will show 11-15 as the value for the corresponding cell.
- If one node does not contain a cell that another node do have, the central part will
simply count zero for the missing cell, i.e. the central part will simply show the
sum of the values that *are* reported.

65 changes: 58 additions & 7 deletions docs/v6-crosstab-py/privacy.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,67 @@
Privacy
=======

.. _privacy-guards:

Guards
------

.. What have you done to protect your users' privacy? E.g. threshold on low counts,
.. noise addition, etc.
There are several guards in place to protect sharing too much information on individual
records:

- **Thresholding**: The system will only share information if there are at least `n`
records in the group. This is to prevent sharing information on individual records.
By default, the threshold is set to 5 records. Node administrators can change this
threshold by adding the following to their node configuration file:

.. code:: yaml
algorithm_env:
CROSSTAB_PRIVACY_THRESHOLD: 5
and setting the value to the desired threshold. This configuration will ensure that
an environment variable `CROSSTAB_PRIVACY_THRESHOLD` is set to the desired threshold
and passed to the algorithm container.

Note that the algorithm also requires at least one field of the contingency table to
pass the threshold. This is to prevent that if a task is created for a column that
contains only unique values, the result would reveal which unique values are present
in the column.

- **Minimum number of data rows to participate**: A node will only participate if it
contains at least `n` data rows. This is to prevent nodes with very little data from
participating in the computation. By default, the minimum number of data rows is set
to 5. Node administrators can change this minimum by adding the following to their
node configuration file:

.. code:: yaml
algorithm_env:
CROSSTAB_MINIMUM_ROWS_TOTAL: 5
- **Not allowing zero values**: By default, the system will not values of zero to be shared.
In principle, it should be OK to share zero values, since this only confirms an
absence of certain combinations of values. However, it may be possible to infer
information from zero values. For example, for a rather sparse contingency table,
the information which combinations exist at a certain data is more valuable than for
a dense table. It is therefore possible not to share this information by not sharing
zero values.

If you do wish to share zero values, you can add the following to your node
configuration:

.. code:: yaml
algorithm_env:
CROSSTAB_ALLOW_ZERO: true
Data sharing
------------

.. which data is shared between the parties? E.g. for an average, sum and total count
.. are shared.
The only intermediate data that is shared, are the local contingency tables. These
are formatted in the same way as the final result, but contain only the data from
the local node. The risk of sharing this data is low, as it concerns aggregated data.

Vulnerabilities to known attacks
--------------------------------
Expand All @@ -27,11 +77,12 @@ Vulnerabilities to known attacks
- Risk eliminated?
- Risk analysis
* - Reconstruction
-
- May happen if ...
-
-
* - Differencing
- ❌
- Possible by doing A then B...
- May be possible by making smart selection with preprocessing, or by sending
multiple tasks before and after data is updated.
* - Deep Leakage from Gradients (DLG)
- ✔
-
Expand Down
9 changes: 7 additions & 2 deletions docs/v6-crosstab-py/references.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
References
==========

.. If applicable, include references to papers, books, or other documents
for further reading.
This particular algorithm has not been published yet. If you use this code in your
research, please cite the following paper:

1. Moncada-Torres, Arturo, et al. "VANTAGE6: an open source priVAcy preserviNg federaTed
leArninG infrastructurE for Secure Insight eXchange." *AMIA annual symposium proceedings.*
Vol. 2020. American Medical Informatics Association, 2020.
`[link] <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075508/>`_
76 changes: 65 additions & 11 deletions docs/v6-crosstab-py/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,22 @@ How to use
Input arguments
---------------

.. describe the input arguments:
.. ['organizations_to_include']
Input arguments
---------------

.. list-table::
:widths: 20 80
:header-rows: 1

* - Argument
- Description
* - ``results_col``
- The column whose categories will be the columns of the contingency table.
* - ``group_cols``
- One or more columns whose categories, or combinations of categories, will be the
rows of the contingency table.
* - ``organizations_to_include``
- Which organizations to include in the computation.

Python client example
---------------------
Expand All @@ -15,7 +29,47 @@ framework. If you are not, please read the `documentation <https://docs.vantage6
first, especially the part about the
`Python client <https://docs.vantage6.ai/en/main/user/pyclient.html>`_.

.. TODO Some explanation of the code below
Let's say you want to know how many males and females are overweight in different age
groups, e.g. something like:

.. list-table::
:widths: 20 20 20 20
:header-rows: 1

* - AgeGroup
- isOverweight
- Male
- Female
* - 0-18
- True
- 11
- 6
* - 0-18
- False
- 30
- 29
* - 18-65
- True
- 55
- 44
* - 18-65
- False
- 50
- 56
* - 65+
- True
- 5
- 10
* - 65+
- False
- 15
- 14


Such a result could be obtained by running the following Python client code. Note that
``AgeGroup``, ``isOverweight``, and ``Gender`` should be categorical values in your
dataset, and that you should replace the values at the top to authenticate with your
vantage6 server.

.. code-block:: python
Expand All @@ -34,23 +88,23 @@ first, especially the part about the
client.authenticate(username, password)
input_ = {
'master': True,
'method': 'central_crosstab',
'args': [],
'kwargs': {
'organizations_to_include': 'my_value',
},
'output_format': 'json'
'results_col': 'Gender',
'group_cols': ["AgeGroup", "isOverweight"]
}
}
my_task = client.task.create(
collaboration=1,
organizations=[1],
name='v6-crosstab-py',
name='Compute contingency table',
description='Create a contingency table showing the relationship between two or more variables',
image='harbor2.vantage6.ai/algorithms/v6-crosstab-py:v4',
image='harbor2.vantage6.ai/algorithms/v6-crosstab-py:latest',
input=input_,
data_format='json'
databases=[
{'label': 'default'}
]
)
task_id = my_task.get('id')
Expand Down
10 changes: 8 additions & 2 deletions docs/v6-crosstab-py/validation.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
Validation
==========

.. Describe how the algorithm has been tested and how a user may test the algorithm
.. themselves (if applicable).
A `test script <https://github.com/vantage6/v6-crosstab-py/blob/main/test/test.py>`_ is
available in the `test` directory. It can be run with the following command:

.. code-block:: bash
python test/test.py
The script will run the crosstab algorithm via the vantage6 ``MockAlgorithmClient``.

0 comments on commit 0f7bb7f

Please sign in to comment.