Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RVecs to root dictionaries #750

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

m-fila
Copy link
Contributor

@m-fila m-fila commented Mar 11, 2025

BEGINRELEASENOTES

  • Add definitions of ROOT::VecOps::RVec to podio dictionary and dictionary for better integration with RDataFrame.

ENDRELEASENOTES

Podio's dictionary and generated datamodels' dictionaries include definitions of std::vector of objects. For better interoperability with RDataFrame definitions with ROOT::VecOps::RVec should be added (see key4hep/EDM4hep#416).

Here I propose to add ROOT::VecOps::RVec for each std::vector in a selection.xml. I'm not sure if the same treatment should apply to definitions like <class name="std::vector<std::tuple<int, std::string, bool, unsigned int>>"/> and <class name="std::vector<std::vector<std::string>>"/> that are in podio's selection.xml.

-Wno-effc++ is added to silence the errors coming from ROOT/RVec.hxx.Weffc++ is currently in the CMAKE_CXX_FLAGS

Closes key4hep/EDM4hep#416

  • Add test

@Zehvogel

@tmadlener
Copy link
Collaborator

I think we should also try to understand what is actually necessary from the ROOT perspective, and which parts might get auto-generated there.

@m-fila
Copy link
Contributor Author

m-fila commented Mar 11, 2025

In case of key4hep/EDM4hep#416 it's in the error message

Error in <TTree::Branch>: The class requested (ROOT::VecOps::RVec<podio::ObjectID>)
for the branch "_BCalRecoParticle_clusters" is an instance of an stl collection 
and does not have a compiled CollectionProxy. 
Please generate the dictionary for this collection (ROOT::VecOps::RVec<podio::ObjectID>)
to avoid to write corrupted data.

The LinkDef for dictionary with RVecs that comes with ROOT is here:
https://github.com/root-project/root/blob/v6-32-04/math/vecops/inc/LinkDef.h

@Zehvogel
Copy link
Contributor

Thanks! I think we still would want to figure out why it currently works for e.g. RVec<edm4hep::ReconstructedParticleData> before closing key4hep/EDM4hep#416

@m-fila
Copy link
Contributor Author

m-fila commented Mar 11, 2025

Well, I guess someone is adding them to some dictionary somewhere for us. For instance here in FCCAnalyses:
https://github.com/HEP-FCC/FCCAnalyses/blob/eef66b3426237a80ea54902a5afa429c2afd5a72/analyzers/dataframe/FCCAnalyses/LinkDef.h#L20-L30

@m-fila
Copy link
Contributor Author

m-fila commented Mar 11, 2025

Well, I guess someone is adding them to some dictionary somewhere for us. For instance here in FCCAnalyses: https://github.com/HEP-FCC/FCCAnalyses/blob/eef66b3426237a80ea54902a5afa429c2afd5a72/analyzers/dataframe/FCCAnalyses/LinkDef.h#L20-L30

yup, after removing fccanalyses from the LD_LIBRARY_PATH snapshottingedm4hep::ReconstructedParticleData is broken again as expected:

>>> import ROOT
>>> df = ROOT.RDataFrame("events", "https://lreichen.webtest.cern.ch/minidst.edm4hep.root")
TClass::Init:0: RuntimeWarning: no dictionary for class edm4hep::Vector2i is available
>>> df.Range(10).Snapshot("events", "snapshot-test.root", ["BCalRecoParticle"])
Error in <TTree::Branch>: The class requested (ROOT::VecOps::RVec<edm4hep::ReconstructedParticleData>) for the branch "BCalRecoParticle" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (ROOT::VecOps::RVec<edm4hep::ReconstructedParticleData>) to avoid to write corrupted data.
RDataFrame::Run: event loop was interrupted
Error in <TTree::Branch>: The class requested (ROOT::VecOps::RVec<edm4hep::ReconstructedParticleData>) for the branch "BCalRecoParticle" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (ROOT::VecOps::RVec<edm4hep::ReconstructedParticleData>) to avoid to write corrupted data.
RDataFrame::Run: event loop was interrupted
Error in <TTree::Branch>: The class requested (ROOT::VecOps::RVec<edm4hep::ReconstructedParticleData>) for the branch "BCalRecoParticle" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (ROOT::VecOps::RVec<edm4hep::ReconstructedParticleData>) to avoid to write corrupted data.
RDataFrame::Run: event loop was interrupted
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Template method resolution failed:
  none of the 3 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RRange<ROOT::Detail::RDF::RLoopManager>,void>::Snapshot(string_view treename, string_view filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    logic_error: Trying to insert a null branch address.
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RRange<ROOT::Detail::RDF::RLoopManager>,void>::Snapshot(string_view treename, string_view filename, const ROOT::RDF::ColumnNames_t& columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    logic_error: Trying to insert a null branch address.
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RRange<ROOT::Detail::RDF::RLoopManager>,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: could not convert argument 3
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RRange<ROOT::Detail::RDF::RLoopManager>,void>::Snapshot(string_view treename, string_view filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    logic_error: Trying to insert a null branch address.
>>> 

@tmadlener
Copy link
Collaborator

That would explain things indeed. In that case, we could consider putting RVec "overloads" into the dictionary. However, that would effectively mean that every change to an EDM generated by podio could be breaking even if our schema evolution could handle it, because technically we don't make guarantees about the file layout:

https://key4hep.web.cern.ch/podio/storage_details.html#storage-details-of-files-produced-by-podio

@Zehvogel
Copy link
Contributor

That would explain things indeed. In that case, we could consider putting RVec "overloads" into the dictionary. However, that would effectively mean that every change to an EDM generated by podio could be breaking even if our schema evolution could handle it, because technically we don't make guarantees about the file layout:

https://key4hep.web.cern.ch/podio/storage_details.html#storage-details-of-files-produced-by-podio

I do not understand where things would break. Can you elaborate?

However, I would say that adding additional support for RDataFrame to podio out of the blue is contradicting the rest of its philosophy a bit. There is not really any reason to directly support the RVecs here. Maybe the cleanest short-term solution is to just put the RVec support into the fccanalyses dict, but in a complete and less manual way? And then also document this and point to it from edm4hep?

@tmadlener
Copy link
Collaborator

I do not understand where things would break. Can you elaborate?

Currently nowhere. But if there is a change in, e.g. EDM4hep that requires schema evolution (that cannot be handled by ROOT), things will break for users, because they will no longer be able to read their data via RDataFrame (aka via RVec). For users going through a podio reader with schema evolution this can be made transparent (apart from API changes, but those can also be dealt with more gracefully with proper deprecations).

I tend to agree here, that we should discuss a bit more whether we want this option. That might be a different discussion for podio and EDM4hep, as simply putting support into podio for generating this might be OK, but then EDM4hep might need some sort of policy or at least a statement about the stability of going to the file directly.

@Zehvogel
Copy link
Contributor

things will break for users, because they will no longer be able to read their data via RDataFrame (aka via RVec). For users going through a podio reader with schema evolution this can be made transparent

Isn't this already the case?

@m-fila
Copy link
Contributor Author

m-fila commented Mar 11, 2025

Isn't this already the case?

As long as you can open files without podio readers, yes

I tend to agree here, that we should discuss a bit more whether we want this option. That might be a different discussion for podio and EDM4hep, as simply putting support into podio for generating this might be OK, but then EDM4hep might need some sort of policy or at least a statement about the stability of going to the file directly.

I agree we should definitely discuss this since it currently basically improve support for unsupported feature. So I would vote no 😇
But then I think the 3rd party packages enabling it by polluting global space with dictionaries for the types they don't own are even worse

I guess this should be a part of a bigger discussion on analysis. Because @Zehvogel why would you even try to abuse RDataFrame like that? 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some columns are not storable using RDataFrame::Snapshot
3 participants