New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Complex Synthetic Feature Generation #70

Merged

SkBlaz merged 8 commits into outbrain:main from 98MM:origin/mm_generators

Jul 15, 2024

Contributor

98MM commented Jul 4, 2024 •

edited by devxappuser

Loading

Added a suite of functions related to synthetic feature generation and corresponding tests.

Located in outrank/algorithms/synthetic_data_generators

https://jira.outbrain.com/browse/mm_generators


          added complex synthetic feature generators

1118ad1

Added a suite of functions related to synthetic feature generation.

Contributor Author

98MM commented Jul 4, 2024

HTML docs are yet to be written.

SkBlaz requested changes

View reviewed changes

Collaborator

SkBlaz left a comment

inspectionProfiles -> we don't need that. Please remove all files that are not relevant to the actual PR's content. Also, ccgen_tests.py and cc_generator.py seem same?

SkBlaz requested review from miha-jenko and bmramor

July 5, 2024 07:17

98MM closed this

98MM reopened this


          removed .idea

cb04d4d

Contributor Author

98MM commented Jul 5, 2024

inspectionProfiles -> we don't need that. Please remove all files that are not relevant to the actual PR's content. Also, ccgen_tests.py and cc_generator.py seem same?

Removed .idea, cc_generator and cc_generator_tests are differet files, i've checked.

98MM requested a review from SkBlaz

July 5, 2024 08:10

SkBlaz reviewed

View reviewed changes

outrank/algorithms/synthetic_data_generators/cc_generator.py Show resolved Hide resolved

miha-jenko requested changes

View reviewed changes

Collaborator

miha-jenko left a comment

I'll continue review later. Some comments first.

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

98MM added 3 commits

July 8, 2024 13:37


          pre-commit, code review changes

30549a4

pre-commit,
code review changes:
- added _feature_builder method to avoid duplicate code blocks
- added some new parameters to enable random value domains for features


          Rewrote tests with unittest instead of pytest

d0d5097


          removed if __name__ == '__main__' from file, small fix in cluster test

79eb4de

SkBlaz requested a review from miha-jenko

July 11, 2024 12:23

Contributor Author

98MM commented Jul 11, 2024

I'll continue review later. Some comments first.

Sorry for not replying earlier, had implemented the necessary changes with the changelog in my commit description, should have replied to the comments as well. Thank you for all the feedback.

miha-jenko requested changes

View reviewed changes

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved


          code review fixes

1e50ee7

renamed _feature_builder -> _configure_generate_featuer

Replace np.ndarray typing with ArrayLike from numpy typing, other typing fixes

miha-jenko reviewed

View reviewed changes

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated Show resolved Hide resolved

outrank/algorithms/synthetic_data_generators/cc_generator.py Show resolved Hide resolved


          Fixes with Literal, added ValueError to generate_noise, small fix in …

c387b51

…summarize

miha-jenko previously approved these changes

View reviewed changes

SkBlaz requested changes

View reviewed changes

Collaborator

SkBlaz left a comment

@98MM few smaller changes, and we should be good to go. I'll add unit tests to CI too soon

outrank/algorithms/synthetic_data_generators/cc_generator.py

+                          Xn_T = X_noise.T
+                          n = Xn_T.shape[1]
+                          n_missing = int(n * p)
+                          #print("n to delete:", n_missing)

Collaborator

SkBlaz Jul 12, 2024

Redundant comment

Contributor Author

98MM Jul 13, 2024

Remoevd redundant comments

outrank/algorithms/synthetic_data_generators/cc_generator.py

+                          #print("n to delete:", n_missing)
+                          for feature in Xn_T:
+                              ixs = np.random.choice(n, n_missing, replace=False)

Collaborator

SkBlaz Jul 12, 2024

Make sure seeds are set for numpy

Contributor Author

98MM Jul 13, 2024

Seeds are now set when class is instantiated with cc = CategoricalClassification(), default is 42, pass seed parameter to change.

outrank/algorithms/synthetic_data_generators/cc_generator.py

+                              ixs = np.random.choice(n, n_missing, replace=False)
+                              for ix in ixs:
+                                  feature[ix] = missing_val

Collaborator

SkBlaz Jul 12, 2024

You in general do a lot of for loop-based indexing -- this is really not the optimal way, this can all be vectorized. Not in this PR thought, let's go over separately sometime

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated

+                          print(f"Number of classes: {self.dataset_info['labels']['n_class']}")
+                          print(f"Class relation: {self.dataset_info['labels']['class_relation']}")
+                      print('-------------------------------------')

Collaborator

SkBlaz Jul 12, 2024

Let's leave this for now, but outrank has proper logging which should be used when time comes, please put as todo

Contributor Author

98MM Jul 13, 2024

Removed summarize function, added TODO

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated

+                              adjustment = samples_per_cluster[i] - cluster_size
+                              adjustments.append(adjustment)
+                              if adjustment < 0:  # Cluter is too large

Collaborator

SkBlaz Jul 12, 2024

Comments always above lines, not after pls

Contributor Author

98MM Jul 13, 2024

Moved all comments above code block they're referencing.

outrank/algorithms/synthetic_data_generators/cc_generator.py Outdated

+                      :return: array of labels, corresponding to dataset X
+                      """
+                      kmeans = KMeans(n_clusters=n)

Collaborator

SkBlaz Jul 12, 2024

Seed for kmeans set?

Contributor Author

98MM Jul 13, 2024

Added random_state parameter for setting clustering seed.

outrank/algorithms/synthetic_data_generators/cc_generator.py

+                                  for feature_ix in feature_ixs:
+                                      # Filling out the dataset up to feature_ix
+                                      if ix < feature_ix:

Collaborator

SkBlaz Jul 12, 2024

Also a general comment - it's not math source, you can name variables with fuller/more idiomatic names, not this PR though

outrank/algorithms/synthetic_data_generators/cc_generator.py

+                      X: ArrayLike,
+                      feature_indices: list[int] | ArrayLike,
+                      combination_function: Optional = None,
+                      combination_type: Literal['linear', 'nonlinear'] = 'linear',

Collaborator

SkBlaz Jul 12, 2024

Looks like str to me - @miha-jenko why Literal?


          fixes based on code review

0019cbc

Removed unnecessary comments,
added global seed, to be set when instantiating the class,
added seed for KMeans clustering

98MM dismissed miha-jenko’s stale review via

0019cbc

July 13, 2024 12:46

98MM requested a review from SkBlaz

July 15, 2024 09:28

SkBlaz approved these changes

View reviewed changes

SkBlaz merged commit 147f037 into outbrain:main

5 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment