-
Notifications
You must be signed in to change notification settings - Fork 57
Feature/initial example tl regression benchmark #629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kalama-ai, thanks for the PR. Here a first high-level view. Once these comments are addressed, I'll go though the actual execution with the debugger and also review the logic of the functions defined toward the end of the module, which I pretty much skimmed over now
benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py
Outdated
Show resolved
Hide resolved
benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments - thanks already for the work :)
2fedc53
to
0001c8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :)
…reation to main loop to be able to reuse the function from existing benchmarks.
…lts from ConvergenceBenchmark.
…eation to main loop.
… to seperate functions.
The previous implementation woudl add an additional task column for data passed to the TL models, which contraticted with the original task column from the search space and lead to no source data used for training. The task parameter is now handled correctly by the search spaces passed to the models.
d389dc4
to
ac3b95b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is still misplaced in a domain/regression
folder
the content should go into domain/direct_arylation/direct_arylation_*_batch
or domain/direct_arylation/direct_arylation_regression.py
If youre at it once I would probably rename the files in domain/direct_arylation
because the prefix direct_arylation
in the file names is obsolete (already specified by the folder name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if I agree here. Currently, the structure is as follows
domains/direct_arylation/
contains only single-task convergence benchmarksdomains/transfer_learning/direct_arylation/
contains transfer learning convergence benchmarksdomains/regression/
contains transfer learning regression benchmarks
I could move the code to benchmarks/domains/transfer_learning/direct_arylation
if you do not like the current structure, but I wouldn't move it to the single-task files. But that would mean future regression benchmarks for other TL scenarios (like aryl halides) would result in 6+ files per directory. Alternatively, we could create domains/regression/direct_arylation/
to mirror the transfer learning structure. Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we need to put this to vote.
Here is where I'm coming from: this expresses the mess quite well:
Thats of course not a problem caused by this PR, but at least it should not contribute further.
Points agains this rather arbitrary structure
-
As I said in my first comment: A domain describes an application / measurement context. The flavors
TL
andRegression
express how the searchspace is modelled and how something is judged -> not a topic of domain. -
Putting these flavors at the same height in the folder hirarchy is also bad logically, they do not describe variations of the same thing.
-
Each flavor added results in more and more folders if there was a correct logical level for each domain and flavor (multiplies as well) - thats not good.
Proposal
So if it was for me, there would only exist domain/direct_arylation
. That describes things truly only belonging to the domain topic.
How the files in there are structuered is another topic. So what about files named after the flavors inside a single folder domain/direct_arylation
? I don't mind having 10 or so file sin there expressing different applications of the domain they belong to (expressed by the folder path).
Example, in domain/domain_name/
we'd have:
convergence.py
convergence_tl.py
regression.py
regression_tl.py
__init__.py
bringing all imports together
This would also get rid of the 100% obsolete filename prefix which is already present in the folder path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy either way. Just keep in mind that for the aryl halides, we will end up with many files in the respective directory: convergence.py, regression.py, sou_CT_I_tar_BM_tl_conv.py, sou_CT_I_tar_BM_tl_regr.py, sou_CT_tar_IM_tl_conv.py, sou_CT_tar_IM_tl_regr.py, sout_IP_tar_CP_tl_conv.py, sout_IP_tar_CP_tl_regr.py, __init__py
. This is because we plan to have one TL regression benchmark for each TL convergence benchmark as a proxy. If you prefer this structure, just let me know. I don’t have a strong opinion here and would prefer to merge soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me put that onto our agenda for discussing within the team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. @Scienfitz , @AVHopp please check if you are happy with the new structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still have one last minor inconsistency: domains/aryl_halides/base.py
should be renamed to domains/aryl_halides/core.py
because it does not contain any inheritance or even class based logic
other than that I'm fine with the structure now :) thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad. Renamed the file.
benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py
Outdated
Show resolved
Hide resolved
benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py
Outdated
Show resolved
Hide resolved
# Collect sampled subsets from each source task | ||
source_subsets: list[pd.DataFrame] = [] | ||
|
||
for source_task in source_tasks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the idea behind this?
Naively I would have now thought that you just do one sample over the task subset (ie the one made with .isin(source_tasks)
But this logic here is different, can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I wanted to ensure that we sample the same fraction from each source task independently. If we only do one sample, the results could be very unbalanced. I the extreme case, we might sample mostly from task A and very little from task B. This could change the transfer learning scenario from what we intended. Instead of testing "many sources with little data per source", we might be testing "few sources with good knowledge per source".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm have we specified anywhere that these tests are only for such specified scenario? Tbh I don't see the problem: If run with enough statistics you result will simply show the distribution coming from both scenarios. We will be able to see the average result without hardcoding any characteristic of the TL sub-scenario.
If not - where are the benchmarks for few sources with good knowledge per source
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I have implemented this specific regression benchmark to align with the overall structure in this PR before expanding to additional benchmarks. My concern is that the performance of our transfer learning models may vary significantly based on whether they receive a large amount of data from a single source or limited data from multiple sources. This is particularly relevant for the hierarchical models I am testing with Karin, which learn a prior from the source. The quality of this prior can greatly influence performance on the target task.
In my view, distinguishing between the "many sources with low data" and "few sources with high data" scenarios is a standard practice in transfer learning. However, if you believe this distinction is unnecessary, I can also omit it, as it's not included in the TL convergence benchmarks. Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still open. Any comments from your side, @Scienfitz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I don't disagree that there are these two different flavors of scenarios
whats not good tho is that which scenario is tested is kind of hidden and hardcoded into this code place here. There is no way of configuring it or even just "seeing" that these benchmarks actually just test that sub-scenario.
If you truly believe these two scenarios need to be looked at separately, at least include an option like stratified_source_sampling
or similar to make it i) configurable and ii) visible.
That seems to me like a minimal fix we could both live with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that works for me. It implemented the new setting in this commit and set the default here (Sorry, for the two commits.) Let me know if you are happy with it.
Co-authored-by: Martin Fitzner <[email protected]>
…mple-tl-regression-benchmark
…mple-tl-regression-benchmark
Main loop and Initial example for a TL regression Benchmark
New
run_tl_regression_benchmark
function that compares vanilla GP models against transfer learning approaches across different source data fractions and training set sizes. Evaluates both naive baselines (GP on reduced/full search spaces without source data) and transfer learning models (with source data).Using 8 regression metrics including Kendall's Tau and Spearman's Rho for model comparison.
Results are stored in the same format as 'ConvergenceBenchmark' with a separate scenario column and metric names like "root_mean_squared_error".
I've also included an example benchmark for direct arylation temperature transfer learning to demonstrate usage. It can be run by
python -m benchmarks --benchmark-list direct_arylation_temperature_tl_regr
(adapt the benchmark configuration settings like number of Monte Carlo iterations and training points for testing).