Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex Synthetic Feature Generation #70

Merged
merged 8 commits into from
Jul 15, 2024

Conversation

98MM
Copy link
Contributor

@98MM 98MM commented Jul 4, 2024

Added a suite of functions related to synthetic feature generation and corresponding tests.

Located in outrank/algorithms/synthetic_data_generators

https://jira.outbrain.com/browse/mm_generators

Added a suite of functions related to synthetic feature generation.
@98MM
Copy link
Contributor Author

98MM commented Jul 4, 2024

HTML docs are yet to be written.

Copy link
Collaborator

@SkBlaz SkBlaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspectionProfiles -> we don't need that. Please remove all files that are not relevant to the actual PR's content. Also, ccgen_tests.py and cc_generator.py seem same?

@SkBlaz SkBlaz requested review from miha-jenko and bmramor July 5, 2024 07:17
@98MM 98MM closed this Jul 5, 2024
@98MM 98MM reopened this Jul 5, 2024
@98MM
Copy link
Contributor Author

98MM commented Jul 5, 2024

inspectionProfiles -> we don't need that. Please remove all files that are not relevant to the actual PR's content. Also, ccgen_tests.py and cc_generator.py seem same?

Removed .idea, cc_generator and cc_generator_tests are differet files, i've checked.

@98MM 98MM requested a review from SkBlaz July 5, 2024 08:10
Copy link
Collaborator

@miha-jenko miha-jenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue review later. Some comments first.

98MM added 3 commits July 8, 2024 13:37
pre-commit,
code review changes:
- added _feature_builder method to avoid duplicate code blocks
- added some new parameters to enable random value domains for features
@SkBlaz SkBlaz requested a review from miha-jenko July 11, 2024 12:23
@98MM
Copy link
Contributor Author

98MM commented Jul 11, 2024

I'll continue review later. Some comments first.

Sorry for not replying earlier, had implemented the necessary changes with the changelog in my commit description, should have replied to the comments as well. Thank you for all the feedback.

renamed _feature_builder -> _configure_generate_featuer

Replace np.ndarray typing with ArrayLike from numpy typing, other typing fixes
miha-jenko
miha-jenko previously approved these changes Jul 12, 2024
Copy link
Collaborator

@SkBlaz SkBlaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@98MM few smaller changes, and we should be good to go. I'll add unit tests to CI too soon

Xn_T = X_noise.T
n = Xn_T.shape[1]
n_missing = int(n * p)
#print("n to delete:", n_missing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remoevd redundant comments

#print("n to delete:", n_missing)

for feature in Xn_T:
ixs = np.random.choice(n, n_missing, replace=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure seeds are set for numpy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeds are now set when class is instantiated with cc = CategoricalClassification(), default is 42, pass seed parameter to change.

ixs = np.random.choice(n, n_missing, replace=False)

for ix in ixs:
feature[ix] = missing_val
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You in general do a lot of for loop-based indexing -- this is really not the optimal way, this can all be vectorized. Not in this PR thought, let's go over separately sometime

print(f"Number of classes: {self.dataset_info['labels']['n_class']}")
print(f"Class relation: {self.dataset_info['labels']['class_relation']}")

print('-------------------------------------')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this for now, but outrank has proper logging which should be used when time comes, please put as todo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed summarize function, added TODO

adjustment = samples_per_cluster[i] - cluster_size
adjustments.append(adjustment)

if adjustment < 0: # Cluter is too large
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments always above lines, not after pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all comments above code block they're referencing.

:return: array of labels, corresponding to dataset X
"""

kmeans = KMeans(n_clusters=n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seed for kmeans set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added random_state parameter for setting clustering seed.


for feature_ix in feature_ixs:
# Filling out the dataset up to feature_ix
if ix < feature_ix:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a general comment - it's not math source, you can name variables with fuller/more idiomatic names, not this PR though

X: ArrayLike,
feature_indices: list[int] | ArrayLike,
combination_function: Optional = None,
combination_type: Literal['linear', 'nonlinear'] = 'linear',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like str to me - @miha-jenko why Literal?

Removed unnecessary comments,
added global seed, to be set when instantiating the class,
added seed for KMeans clustering
@SkBlaz SkBlaz merged commit 147f037 into outbrain:main Jul 15, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants