-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U/jrbogart/truth parquet reader #455
Conversation
@yymao Thanks for being willing to take a look! I hope the summary below makes your job a little easier
|
Latest commit fixed the native filter query problem. |
Eliminate debug output from truth parquet reader Add catalog for star lc stats truth Rename truth catalogs to include full run designation
@JoanneBogart Thank you for the draft PR and for your patience! I haven't done a detailed review but have two high-level comments: First, I see what you are saying about making Of course I am also sensitive to our limited time/resource. An alternative suggestion is to make the The I haven't looked into the native filter issue, but will do. More later. |
@yymao Thanks for your comments! I know you're heavily over-subscribed and I really appreciate that you found some time for this. |
Thank you @JoanneBogart! We do have a Glad to hear the native filter issue is resolved! |
@yymao I've moved ParquetFileWrapper to parquet.py and renamed the class ParsePathInfo --> PathInfoParser
My feeling is that the ParquetFileWrapper items might be appropriate to do now - the changes involved are minimal - but PathInfoParser should wait until we've defined more or less what the new keyword structure for configs will be. |
@yymao The ParquetDataset feature does look interesting. It probably is faster and less memory intensive to let pyarrow do the cuts when possible, but it's not as general as GCRQuery. Is it adequate for what people typically do? Could it be invoked for those cases it covers (which would mean recognizing them) and fall back to current code for the rest? |
@JoanneBogart -- thanks for the update and patience! I am still a bit swamped now, but will get back to this later this week! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments on PathInfoParser
@JoanneBogart thanks for waiting. One high-level question: do you envision that a
I wonder if it makes sense to have Re: your potential to-do list, here's my thoughts:
|
Suggestions gratefully accepted. Thanks, @yyao Co-authored-by: Yao-Yuan Mao <[email protected]>
@yymao Yes, I think it would be better to restructure Parquet..Wrapper as you suggest. And I will move PathInfoParser out of the reader - that way it's easily available to new readers - but I also think modifying existing readers to use it can wait for another PR. |
Actually, I'm not sure it's necessary to do this. ParquetFileWrapper is intended to represent a file, not a row group. _iter_native_dataset in my reader has an inner loop over row groups
This seems to be doing the right thing without creating multiple instances of the class per file in case there is > 1 row group. However, it's not great that the reader has to include code that maybe more properly belongs in a parquet utility class. For that reason it might be better to reorganize. |
@JoanneBogart Ah, I see. OK that makes sense but maybe we can be more explicitly about this behavior, for example, should we use Also, should the method Or something along these lines to make it clear that one doesn't need a new instance for a different |
I need to document usage in parquet.py. Is there is a use case for calling read_columns_row_group other than from within _obtain_native_data_dict, where it could be useful to specify a particular row group? |
@yymao should native filter quantities also be selectable columns (i.e., included in native quantities)? For object catalog I can ask for 'tract', but I don't see where this is coming from - whether a column 'tract' is added to the dataframe or if it's discovered some other way. |
@JoanneBogart re: But my high-level comment is just that the attribute and method name should be clear that all row groups should be accessed with just one instance, and this is not so clear right now. In particular, setting |
|
Ok; that's what I was beginning to suspect. |
Change property name to courrent_row_group Eliminate unneeded code
@JoanneBogart sorry that I didn't notice you have marked this pull request as ready for review! I'll review it by Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JoanneBogart. I've added some minor suggestions/comments, and a question about backward compatibility.
GCRCatalogs/catalog_configs/dc2_truth_run2.2i_galaxy_truth_summary.yaml
Outdated
Show resolved
Hide resolved
Commit reviewer's suggested changes Co-authored-by: Yao-Yuan Mao <[email protected]>
@yymao thanks for the suggested changes, which I've committed. I changed DC2DMCatalog to use I don't have strong feelings about it. If someone has code which is depending on the data coming back a full tract (or healpix or whatever) at a time, there could indeed be a compatibility issue. Do you think there is a chance of this? If so, it certainly should be changed back to read_columns. |
@JoanneBogart Thanks. I did suggest that we switch to the new parquet class you wrote in The chunks that the iterator returns should indeed stay the same, because many add-on catalogs depends on that (they are split into same chunks so that the composite catalog can be efficient). That said, I am guessing all the current parquet files that we are using all have single row group, but I haven't verified that. I'll be comfortable using |
I think I had better change it back to |
…_columns_row_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @JoanneBogart! Approving now.
Add a reader for simple catalogs persisted as parquet files, such as those for truth tables
fix #445