Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

LightGBM and 5fold CV

Kamil A. Kaczmarek edited this page Jul 10, 2018 · 2 revisions

honey bee 🐝

Validation:

5-fold CV with folds generated via custom implementation, where data is sorted by target value and then observations are put to the separate folds one by one. It is implemented in the utils.py:L111

class KFoldByTargetValue(BaseCrossValidator):
    def __init__(self, n_splits=3, shuffle=False, random_state=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state

    def _iter_test_indices(self, X, y=None, groups=None):
        n_samples = X.shape[0]
        indices = np.arange(n_samples)

        sorted_idx_vals = sorted(zip(indices, X), key=lambda x: x[1])
        indices = [idx for idx, val in sorted_idx_vals]

        for split_start in range(self.n_splits):
            split_indeces = indices[split_start::self.n_splits]
            yield split_indeces

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_split

Preprocessing

  • drop constant columns
  • drop duplicate columns
  • drop columns where zero over % of time

Feature Extraction

  • as is (taken directly from competition data)

Model

  • lightGBM raw 1.39 CV 1.43 Public LB
  • zero treated as missing

Pipeline diagram

pipeline-solution-1