Hi! I really like your package implementation and its compatibility with sklearn, but one thing that is refraining us from using it in our solution is the lack of the above predict function implementations (particularly predict_expectation). In our case, we need to know the expected number of days a given client is expected to remain a client. Ranking observations by the obtained "arbitrary" score will not work.
Given that the above function is based on the estimated survival functions it should be possible to implement (as long as I read the lifelines code correctly):
def predict_expectation(self, X: DataFrame, conditional_after: Optional[ndarray] = None) -> pd.Series:
r"""
Compute the expected lifetime, :math:`E[T]`, using covariates X. This algorithm to compute the expectation is
to use the fact that :math:`E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt`. To compute the integral, we use the trapezoidal rule to approximate the integral.
Caution
--------
If the survival function doesn't converge to 0, then the expectation is really infinity and the returned
values are meaningless/too large. In that case, using ``predict_median`` or ``predict_percentile`` would be better.
Parameters
----------
X: numpy array or DataFrame
a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns
can be in any order. If a numpy array, columns must be in the
same order as the training data.
conditional_after: iterable, optional
Must be equal is size to X.shape[0] (denoted `n` above). An iterable (array, list, series) of possibly non-zero values that represent how long the
subject has already lived for. Ex: if :math:`T` is the unknown event time, then this represents :math:`s` in
:math:`T | T > s`. This is useful for knowing the *remaining* hazard/survival of censored subjects.
The new timeline is the remaining duration of the subject, i.e. normalized back to starting at 0.
Notes
-----
If X is a DataFrame, the order of the columns do not matter. But
if X is an array, then the column ordering is assumed to be the
same as the training dataset.
See Also
--------
predict_median
predict_percentile
"""
subjects = utils._get_index(X)
v = self.predict_survival_function(X, conditional_after=conditional_after)[subjects]
return pd.Series(trapz(v.values.T, v.index), index=subjects)
Reference to the lifelines implementation:
- https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html#lifelines.fitters.coxph_fitter.SemiParametricPHFitter.predict_expectation
- https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/coxph_fitter.py
Hi! I really like your package implementation and its compatibility with sklearn, but one thing that is refraining us from using it in our solution is the lack of the above predict function implementations (particularly predict_expectation). In our case, we need to know the expected number of days a given client is expected to remain a client. Ranking observations by the obtained "arbitrary" score will not work.
Given that the above function is based on the estimated survival functions it should be possible to implement (as long as I read the lifelines code correctly):
Reference to the lifelines implementation: