This repository contains an implementation of the Kaplan-Meier curve calculation designed for federated learning environments via the vantage6 framework. It allows for the estimation of survival probabilities across distributed datasets without sharing the patient-specific information. This method supports privacy-enhancing data analysis in medical research and other fields where event-time analysis is critical.
TODO: Add a link to the documentation.
The initial idea was based on contributions from Benedetta Gottardelli [email protected]. It was further developed by Medical Data Works and finally adapated by the members of the BlueBerry project.
In order to minimize the risk of reconstruction the number of organizations should be at least 3. The value of this threshold can be changed by setting KAPLAN_MEIER_MINIMUM_ORGANIZATIONS
. Note that this threshold can be set by the aggregator party only!
Important
In case the aggregator node does not supply this environment variable, the default value of 3 will be used.
The algorithm will only share information if there are at least n records present in the local dataset. This can be set using the variable KAPLAN_MEIER_MINIMUM_NUMBER_OF_RECORDS
.
Important
In case the node does not supply this environment variable, the default value of 3 will be used.
In order to limit the options the user has for selecting the event time column the KAPLAN_MEIER_ALLOWED_EVENT_TIME_COLUMNS_REGEX
can be set to a comma separated list. Each element in the list can be a regex pattern.
Important
In case the node does not supply this environment variable, the default value of .*
will be used. Which means that any column can be used. In order to limit the options the user has to select the event time column, the regex can be set to a more specific value. E.g. ^event_time$
will only allow the column named event_time
to be used.
In order to protect the individual event times noise can be added to this column. The column is user defined, see “Fixed event time column” section.
The type of noise can be set through KAPLAN_MEIER_TYPE_NOISE
. This can be one of the following:
NONE
– no noise will be added to the time event columnsGAUSSIAN
– Gaussian noise will be added, the amount of noise can be controlled to a signal to noise ratio:KAPLAN_MEIER_PRIVACY_SNR_EVENT_TIME
. The SNR is defined as the amount of noise compared to the standard deviation of the original signal.POISSON
– Poisson noise will be applied.
Important
In case the node does not supply this environment variable, the default value of POISSON
will be used.
In order to build its best to use the makefile.
make image VANTAGE6_VERSION=4.5.5