Feature/initial example tl regression benchmark #629

kalama-ai · 2025-08-27T13:39:08Z

Main loop and Initial example for a TL regression Benchmark

New run_tl_regression_benchmark function that compares vanilla GP models against transfer learning approaches across different source data fractions and training set sizes. Evaluates both naive baselines (GP on reduced/full search spaces without source data) and transfer learning models (with source data).

Using 8 regression metrics including Kendall's Tau and Spearman's Rho for model comparison.

Results are stored in the same format as 'ConvergenceBenchmark' with a separate scenario column and metric names like "root_mean_squared_error".

I've also included an example benchmark for direct arylation temperature transfer learning to demonstrate usage. It can be run by python -m benchmarks --benchmark-list direct_arylation_temperature_tl_regr (adapt the benchmark configuration settings like number of Monte Carlo iterations and training points for testing).

CLAassistant · 2025-08-27T13:39:15Z

All committers have signed the CLA.

AdrianSosic

Hi @kalama-ai, thanks for the PR. Here a first high-level view. Once these comments are addressed, I'll go though the actual execution with the debugger and also review the logic of the functions defined toward the end of the module, which I pretty much skimmed over now

benchmarks/domains/__init__.py

benchmarks/domains/regression/base.py

benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py

benchmarks/domains/regression/base.py

benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py

AVHopp

First round of comments - thanks already for the work :)

benchmarks/domains/regression/base.py

AVHopp

Looks good :)

benchmarks/domains/regression/base.py

…reation to main loop to be able to reuse the function from existing benchmarks.

…lts from ConvergenceBenchmark.

…eation to main loop.

… to seperate functions.

The previous implementation woudl add an additional task column for data passed to the TL models, which contraticted with the original task column from the search space and lead to no source data used for training. The task parameter is now handled correctly by the search spaces passed to the models.

…tin/regression

Scienfitz · 2025-09-17T12:20:20Z

benchmarks/domains/direct_arylation/regression_tl.py

This file is still misplaced in a domain/regression folder
the content should go into domain/direct_arylation/direct_arylation_*_batch or domain/direct_arylation/direct_arylation_regression.py

If youre at it once I would probably rename the files in domain/direct_arylation because the prefix direct_arylation in the file names is obsolete (already specified by the folder name)

Not sure, if I agree here. Currently, the structure is as follows

domains/direct_arylation/ contains only single-task convergence benchmarks

domains/transfer_learning/direct_arylation/ contains transfer learning convergence benchmarks

domains/regression/ contains transfer learning regression benchmarks

I could move the code to benchmarks/domains/transfer_learning/direct_arylation if you do not like the current structure, but I wouldn't move it to the single-task files. But that would mean future regression benchmarks for other TL scenarios (like aryl halides) would result in 6+ files per directory. Alternatively, we could create domains/regression/direct_arylation/ to mirror the transfer learning structure. Just let me know.

Seems we need to put this to vote.

Here is where I'm coming from: this expresses the mess quite well:

Thats of course not a problem caused by this PR, but at least it should not contribute further.

Points agains this rather arbitrary structure

As I said in my first comment: A domain describes an application / measurement context. The flavors TL and Regression express how the searchspace is modelled and how something is judged -> not a topic of domain.

Putting these flavors at the same height in the folder hirarchy is also bad logically, they do not describe variations of the same thing.

Each flavor added results in more and more folders if there was a correct logical level for each domain and flavor (multiplies as well) - thats not good.

Proposal
So if it was for me, there would only exist domain/direct_arylation. That describes things truly only belonging to the domain topic.

How the files in there are structuered is another topic. So what about files named after the flavors inside a single folder domain/direct_arylation ? I don't mind having 10 or so file sin there expressing different applications of the domain they belong to (expressed by the folder path).

Example, in domain/domain_name/ we'd have:

convergence.py

convergence_tl.py

regression.py

regression_tl.py

__init__.py bringing all imports together

This would also get rid of the 100% obsolete filename prefix which is already present in the folder path

Happy either way. Just keep in mind that for the aryl halides, we will end up with many files in the respective directory: convergence.py, regression.py, sou_CT_I_tar_BM_tl_conv.py, sou_CT_I_tar_BM_tl_regr.py, sou_CT_tar_IM_tl_conv.py, sou_CT_tar_IM_tl_regr.py, sout_IP_tar_CP_tl_conv.py, sout_IP_tar_CP_tl_regr.py, __init__py. This is because we plan to have one TL regression benchmark for each TL convergence benchmark as a proxy. If you prefer this structure, just let me know. I don’t have a strong opinion here and would prefer to merge soon.

Let me put that onto our agenda for discussing within the team.

Done. @Scienfitz , @AVHopp please check if you are happy with the new structure.

we still have one last minor inconsistency: domains/aryl_halides/base.py should be renamed to domains/aryl_halides/core.py because it does not contain any inheritance or even class based logic

other than that I'm fine with the structure now :) thanks

Sorry, my bad. Renamed the file.

benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py

benchmarks/definition/regression/settings.py

benchmarks/definition/regression/core.py

Scienfitz · 2025-09-17T13:07:35Z

benchmarks/definition/regression/core.py

+    # Collect sampled subsets from each source task
+    source_subsets: list[pd.DataFrame] = []
+
+    for source_task in source_tasks:


whats the idea behind this?
Naively I would have now thought that you just do one sample over the task subset (ie the one made with .isin(source_tasks)
But this logic here is different, can you elaborate?

Sure. I wanted to ensure that we sample the same fraction from each source task independently. If we only do one sample, the results could be very unbalanced. I the extreme case, we might sample mostly from task A and very little from task B. This could change the transfer learning scenario from what we intended. Instead of testing "many sources with little data per source", we might be testing "few sources with good knowledge per source".

Hmm have we specified anywhere that these tests are only for such specified scenario? Tbh I don't see the problem: If run with enough statistics you result will simply show the distribution coming from both scenarios. We will be able to see the average result without hardcoding any characteristic of the TL sub-scenario.

If not - where are the benchmarks for few sources with good knowledge per source?

Currently, I have implemented this specific regression benchmark to align with the overall structure in this PR before expanding to additional benchmarks. My concern is that the performance of our transfer learning models may vary significantly based on whether they receive a large amount of data from a single source or limited data from multiple sources. This is particularly relevant for the hierarchical models I am testing with Karin, which learn a prior from the source. The quality of this prior can greatly influence performance on the target task.

In my view, distinguishing between the "many sources with low data" and "few sources with high data" scenarios is a standard practice in transfer learning. However, if you believe this distinction is unnecessary, I can also omit it, as it's not included in the TL convergence benchmarks. Just let me know.

This is still open. Any comments from your side, @Scienfitz ?

so I don't disagree that there are these two different flavors of scenarios

whats not good tho is that which scenario is tested is kind of hidden and hardcoded into this code place here. There is no way of configuring it or even just "seeing" that these benchmarks actually just test that sub-scenario.

If you truly believe these two scenarios need to be looked at separately, at least include an option like stratified_source_sampling or similar to make it i) configurable and ii) visible.

That seems to me like a minimal fix we could both live with?

Yes, that works for me. It implemented the new setting in this commit and set the default here (Sorry, for the two commits.) Let me know if you are happy with it.

Co-authored-by: Martin Fitzner <[email protected]>

…mple-tl-regression-benchmark

benchmarks/definition/regression/settings.py

…mple-tl-regression-benchmark

…ark.

kalama-ai requested review from Scienfitz, AdrianSosic and AVHopp as code owners August 27, 2025 13:39

AdrianSosic reviewed Aug 28, 2025

View reviewed changes

Scienfitz reviewed Aug 29, 2025

View reviewed changes

benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py Outdated Show resolved Hide resolved

benchmarks/domains/regression/base.py Outdated Show resolved Hide resolved

benchmarks/domains/regression/direct_arylation_temperature_tl_regr.py Outdated Show resolved Hide resolved

AVHopp assigned kalama-ai Aug 29, 2025

AVHopp reviewed Aug 29, 2025

View reviewed changes

kalama-ai force-pushed the feature/initial-example-tl-regression-benchmark branch 2 times, most recently from 2fedc53 to 0001c8b Compare September 4, 2025 09:34

AVHopp approved these changes Sep 5, 2025

View reviewed changes

benchmarks/domains/regression/base.py Outdated Show resolved Hide resolved

benchmarks/domains/regression/base.py Show resolved Hide resolved

benchmarks/domains/regression/base.py Show resolved Hide resolved

benchmarks/domains/regression/base.py Show resolved Hide resolved

kalama-ai added 19 commits September 15, 2025 15:40

Initial base loop for regression benchmarks.

5a512b0

Replacing dictionary by a simple list of functions.

ddc41bc

Use same scenario naming as ConvergenceBenchmarks; move searchspace c…

37396d0

…reation to main loop to be able to reuse the function from existing benchmarks.

Improving result storage of RegressionBenchmark to be similar to resu…

3afadd7

…lts from ConvergenceBenchmark.

Removing legacy function.

3de15f9

Moving list of regression metrics to function; moving search space cr…

ec274c8

…eation to main loop.

Adding the direct arylation TL regression benchmark to the __init__.py

91fac61

Saving the seed for source data generation.

25c0365

Cleaning up the main loop: Moving model training and data preperation…

f375ba1

… to seperate functions.

Setting reasonable values for benchmark again.

6b4e7ee

Typo in docstrings.

c9c20c1

Organize BENCHMAKRS list with headline comments

93f6b07

Remove GP-specific wording from docstrings

ec90b92

Fix function signature

c7f0513

Remove trivial comments

6fd7b95

Use search space (vs. searchspace) consistently in comments

7875ce1

Add type annotations for list containers

2d2fdf1

Move Tl-models to constant dict at the module start

bfecf42

kalama-ai added 11 commits September 15, 2025 15:40

Remove legacy parameters

a011f73

Use sklearn built in fct for train/test split

3e32a37

Implement protocols for searchspace/objective creation and data loading

56d2012

Move protocaols to top of script

929cdc0

Support multiple active values in searchspace

e5d8121

Add docstring comment to TL_MODELS

7eac7bf

Add docstring comment to REGRESSION_METRICS, remove fullstop

93eb052

Rename base.py to core.py

21f5514

Move regression core functionality to new submodule benchmarks/defini…

d689d2b

…tin/regression

Fix import

32e4e09

Move core.py to definitions submodule

ac3b95b

kalama-ai force-pushed the feature/initial-example-tl-regression-benchmark branch from d389dc4 to ac3b95b Compare September 15, 2025 13:43

Implment RunModes for regression benchmarks

2234732

Scienfitz requested changes Sep 17, 2025

View reviewed changes

kalama-ai and others added 10 commits September 19, 2025 12:26

Reduce number of train points in SMOKETEST

eb4c75a

Co-authored-by: Martin Fitzner <[email protected]>

Reduce source fraction in SMOKETEST

0ef2524

Remove unnecessary variable

9c297aa

Let TransferLearningRegressionBenchmark inherit from RegressionBenchmark

6716575

Add noise to target training data

33168a3

Remove duplicate naive model training

faff3f1

Merge remote-tracking branch 'upstream/main' into feature/initial-exa…

c05c878

…mple-tl-regression-benchmark

Movin all TL domains to global BM domains

cfa290b

Remove TL BM domain and move BMs to domain level

2f4a676

Add TL regression BM to aryl halide domain

bb50c47

Scienfitz reviewed Sep 25, 2025

View reviewed changes

benchmarks/definition/regression/settings.py Show resolved Hide resolved

kalama-ai added 5 commits September 26, 2025 11:59

Merge remote-tracking branch 'upstream/main' into feature/initial-exa…

9d7c00e

…mple-tl-regression-benchmark

Rename benchmarks/domains/aryl_halides/base.py to core.py.

c554409

Add missing entry to RegressionBenchmark class

cc5a895

Make stratified source sampling configurable for TL regression benchm…

fba65a5

…ark.

Use stratified source sampling per default in TL regression BMs

e3043f4

Feature/initial example tl regression benchmark #629

Are you sure you want to change the base?

Feature/initial example tl regression benchmark #629

Conversation

kalama-ai commented Aug 27, 2025

Uh oh!

CLAassistant commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdrianSosic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AVHopp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AVHopp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Aug 27, 2025 •

edited

Loading