Skip to content

Add predict helpers such as: predict_expectation, predict_percentile (similar to lifelines) #190

@konradsemsch

Description

@konradsemsch

Hi! I really like your package implementation and its compatibility with sklearn, but one thing that is refraining us from using it in our solution is the lack of the above predict function implementations (particularly predict_expectation). In our case, we need to know the expected number of days a given client is expected to remain a client. Ranking observations by the obtained "arbitrary" score will not work.

Given that the above function is based on the estimated survival functions it should be possible to implement (as long as I read the lifelines code correctly):

    def predict_expectation(self, X: DataFrame, conditional_after: Optional[ndarray] = None) -> pd.Series:
        r"""
        Compute the expected lifetime, :math:`E[T]`, using covariates X. This algorithm to compute the expectation is
        to use the fact that :math:`E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt`. To compute the integral, we use the trapezoidal rule to approximate the integral.
        Caution
        --------
        If the survival function doesn't converge to 0, then the expectation is really infinity and the returned
        values are meaningless/too large. In that case, using ``predict_median`` or ``predict_percentile`` would be better.
        Parameters
        ----------
        X: numpy array or DataFrame
            a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns
            can be in any order. If a numpy array, columns must be in the
            same order as the training data.
        conditional_after: iterable, optional
            Must be equal is size to X.shape[0] (denoted `n` above).  An iterable (array, list, series) of possibly non-zero values that represent how long the
            subject has already lived for. Ex: if :math:`T` is the unknown event time, then this represents :math:`s` in
            :math:`T | T > s`. This is useful for knowing the *remaining* hazard/survival of censored subjects.
            The new timeline is the remaining duration of the subject, i.e. normalized back to starting at 0.
        Notes
        -----
        If X is a DataFrame, the order of the columns do not matter. But
        if X is an array, then the column ordering is assumed to be the
        same as the training dataset.
        See Also
        --------
        predict_median
        predict_percentile
        """
        subjects = utils._get_index(X)
        v = self.predict_survival_function(X, conditional_after=conditional_after)[subjects]
        return pd.Series(trapz(v.values.T, v.index), index=subjects)

Reference to the lifelines implementation:

  1. https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html#lifelines.fitters.coxph_fitter.SemiParametricPHFitter.predict_expectation
  2. https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/coxph_fitter.py

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions