LightGBM and 5fold CV

honey bee 🐝

Validation:

5-fold CV with folds generated via custom implementation, where data is sorted by target value and then observations are put to the separate folds one by one. It is implemented in the utils.py:L111

class KFoldByTargetValue(BaseCrossValidator):
    def __init__(self, n_splits=3, shuffle=False, random_state=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state

    def _iter_test_indices(self, X, y=None, groups=None):
        n_samples = X.shape[0]
        indices = np.arange(n_samples)

        sorted_idx_vals = sorted(zip(indices, X), key=lambda x: x[1])
        indices = [idx for idx, val in sorted_idx_vals]

        for split_start in range(self.n_splits):
            split_indeces = indices[split_start::self.n_splits]
            yield split_indeces

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_split

Preprocessing

drop constant columns
drop duplicate columns
drop columns where zero over % of time

Feature Extraction

as is (taken directly from competition data)

Model

lightGBM raw 1.39 CV 1.43 Public LB
zero treated as missing

Pipeline diagram

pipeline-solution-1

check our GitHub organization https://github.com/neptune-ml for more cool stuff 😃

Kamil & Kuba, core contributors

Open solutions

honey bee 🐝 LightGBM and 5fold CV
beetle 🪲 LightGBM on binarized dataset
dromedary camel 🐪 LightGBM with row aggregations
whale 🐳 LightGBM on dimension reduced dataset
water buffalo 🐃 Exploring various dimension reduction techniques
blowfish 🐡 bucketing row aggregations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly