Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syn_Seq: python implementation and enhancement of R synthpop #324

Open
wants to merge 99 commits into
base: main
Choose a base branch
from

Conversation

minkeymouse
Copy link

Description

This pull request introduces a Python implementation of a sequential synthesis(syn_seq) plugin inspired by the R package synthpop. The new plugin is fully compatible with the synthcity library and offers the following key features:

  • Sequential Synthesis: Generates synthetic data in a sequential manner.
  • User Modification: Allows users to easily modify synthesis parameters.
  • Pre/Post Processing Utilities: Provides built-in preprocessing and postprocessing support for the syn_seq module (with potential for extension to other utilities).

The motivation behind these changes is to extend synthcity’s functionality while maintaining compatibility with the existing framework and ensuring a user-friendly, customizable synthesis experience.

Affected Dependencies

No new dependencies have been added; the changes build directly on top of the current synthcity framework.

How has this been tested?

  • Unit Testing: We have added comprehensive unit tests using pytest. These tests verify that the plugin correctly sets up its internal configuration, processes data as expected, and returns synthetic outputs that meet the specified constraints.
  • Reproduction: To reproduce the tests, run pytest from the project’s root directory. All necessary details regarding the test configuration (Python version, dependencies, etc.) are provided in the project documentation.
  • Test Configuration: The tests were executed on Python 3.9, and all changes adhere to the synthcity guidelines.

Checklist

@robsdavis
Copy link
Contributor

Thanks so much for all your work! I am reviewing the pull request now - it would be great to include this work in Synthcity!

Firstly, could you please add typing to all the code such that it passes the linter test above? Synthcity currently mandates that all code is typed and this is tested with mypy, which is implemented with the pre-commit.

To check this locally, you can run pre-commit run --all in the root dir. Just make sure you have set up the pre-commit hooks with pre-commit install first.

Many thanks!

@robsdavis robsdavis self-requested a review February 28, 2025 14:03
@minkeymouse
Copy link
Author

Thank you for your kind words and all the things you taught us with your well-structured codebase. We learned a lot from you. Again, many appreciation. We have implemented pre-commit fix and now they seem good to go with typing issues. Please let us know if there's any other issues so we will deal with them asap. Thanks.

@robsdavis
Copy link
Contributor

Hi, I've just started a more thorough review of your code and made a couple of comments. More importantly though, it would be great if you could explain what these plugins add conceptually to synthcity. e.g. what are the pros and cons of generating synthetic data sequentially this way? Could you also add some more information the the README to cover these new plugins? Many thanks!

@robsdavis
Copy link
Contributor

Hi @minkeymouse and @hsrhee97,

Actually, most important of all, I have just looked at the synthpop R library and I see it is licensed under "GPL-2 | GPL-3". This is incompatible with our Apache-2.0 licence (see this copyright licence diagram if you are unfamiliar). In order to to satisfy the copy-right law your implementation must be "clean-room", see this link for details. Are you confident that your implementation does not break the licensing of the synthpop package? Apologies for not spotting this first of all.

@minkeymouse
Copy link
Author

Hi @minkeymouse and @hsrhee97,

Actually, most important of all, I have just looked at the synthpop R library and I see it is licensed under "GPL-2 | GPL-3". This is incompatible with our Apache-2.0 licence (see this copyright licence diagram if you are unfamiliar). In order to to satisfy the copy-right law your implementation must be "clean-room", see this link for details. Are you confident that your implementation does not break the licensing of the synthpop package? Apologies for not spotting this first of all.

We are currently writing two short documentations on

  1. Why we believe our implementation is clean-room -> email
  2. README on pros and cons of sequential data synthesis -> file

Once we are done, we will reply you as soon as possible. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants