Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U/jrbogart/truth parquet reader #455

Merged
merged 15 commits into from
Jun 29, 2020
Merged

Conversation

JoanneBogart
Copy link
Contributor

Add a reader for simple catalogs persisted as parquet files, such as those for truth tables
fix #445

@JoanneBogart JoanneBogart added the reader Issues related to reader modules label Jun 3, 2020
@JoanneBogart
Copy link
Contributor Author

JoanneBogart commented Jun 3, 2020

@yymao Thanks for being willing to take a look! I hope the summary below makes your job a little easier

  • As far as a I know the reader works satisfactorily for simple catalogs with no native filters (truth summary and variability for each of sn and star).
  • I made galaxy truth summary catalogs for three healpixels so far (out of 131 total; it is straightforward to make the rest) so I could test a catalog with a native filter. So far I can load the catalog and get data, but queries with a condition on a native filter fail. I could be making the query wrong, but more likely there is a bug in the reader.
  • the class used for parsing the file pattern and finding native filters is general. The pattern as it appears in the config file may be a totally standard regular expressions, using named groups for native filter names, or it will accept a shorthand {healpix} or, more generally, {some_name} and translate to standard re notation with a named group. Do we want to use this in other places? And if so, does the translation feature seem helpful or just confusing?
  • I would like to (also) separate out some version of ParquetFileWrapper so it can be used by different readers.
  • The last two points likely have some bearing on issue Establish a standardized catalog config specification and corresponding validation tests #448

@JoanneBogart
Copy link
Contributor Author

Latest commit fixed the native filter query problem.

Eliminate debug output from truth parquet reader
Add catalog for star lc stats truth
Rename truth catalogs to include full run designation
@yymao
Copy link
Member

yymao commented Jun 7, 2020

@JoanneBogart Thank you for the draft PR and for your patience! I haven't done a detailed review but have two high-level comments:

First, I see what you are saying about making ParquetFileWrapper a class in some utility model. I agree with you, and wonder if we should just do it now. My gut feeling is that the incentive of doing so would decrease with time, and we will end up having several slightly different ParquetFileWrapper classes.

Of course I am also sensitive to our limited time/resource. An alternative suggestion is to make the ParquetFileWrapper a subclass of dc2_dm_catalog.ParquetFileWrapper. This should be straightforward and it will gives future devs a hint that, if they need another similar class, they should either make another subclass, or merge them into a utility class.

The ParsePathInfo class is a clever approach (but can we call it PathInfoParser so that it's a noun?). Yesterday I was actually just reading the ParquetDataset feature in pyarrow and wonder if there's something we can borrow from?

I haven't looked into the native filter issue, but will do. More later.

@JoanneBogart
Copy link
Contributor Author

@yymao Thanks for your comments! I know you're heavily over-subscribed and I really appreciate that you found some time for this.
I prefer the first approach (separate out ParquetFileWrapper into a utility module); I can start work on that. Any suggestions on naming, overall structure, etc.?
I'll change the name to PathInfoParser and look up your reference; thanks.
Since my last commit the native filter has been working properly. The problem was a missing conversion of a string representation of the value to int.

@yymao
Copy link
Member

yymao commented Jun 8, 2020

Thank you @JoanneBogart! We do have a utils.py module but it doesn't feel right to put ParquetFileWrapper there. Maybe it could just be parquet.py? If in the future we have more common classes, we probably won't put them all into the same module? Other ideas are welcome.

Glad to hear the native filter issue is resolved!

@JoanneBogart
Copy link
Contributor Author

JoanneBogart commented Jun 11, 2020

@yymao I've moved ParquetFileWrapper to parquet.py and renamed the class ParsePathInfo --> PathInfoParser
Should I go any further in this PR? Possible steps to consider are

  • modify dc2_dm_catalog to use the new version of ParquetFileWrapper and delete the internal one
  • Assuming previous step has been done, use read_columns_row_group rather than read_columns (only matters if individual parquet files are large enough to merit > 1 row group)
  • Extract PathInfoParser into a utility module (could be utils.py or something else)
  • Modify some existing readers to use it

My feeling is that the ParquetFileWrapper items might be appropriate to do now - the changes involved are minimal - but PathInfoParser should wait until we've defined more or less what the new keyword structure for configs will be.

@JoanneBogart
Copy link
Contributor Author

@yymao The ParquetDataset feature does look interesting. It probably is faster and less memory intensive to let pyarrow do the cuts when possible, but it's not as general as GCRQuery. Is it adequate for what people typically do? Could it be invoked for those cases it covers (which would mean recognizing them) and fall back to current code for the rest?

@yymao
Copy link
Member

yymao commented Jun 16, 2020

@JoanneBogart -- thanks for the update and patience! I am still a bit swamped now, but will get back to this later this week!

Copy link
Member

@yymao yymao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments on PathInfoParser

GCRCatalogs/dc2_truth_parquet.py Outdated Show resolved Hide resolved
GCRCatalogs/dc2_truth_parquet.py Outdated Show resolved Hide resolved
GCRCatalogs/dc2_truth_parquet.py Outdated Show resolved Hide resolved
@yymao
Copy link
Member

yymao commented Jun 23, 2020

@JoanneBogart thanks for waiting. One high-level question: do you envision that a ParquetFileWrapper instance represent only one row group? If so, I have two concerns:

  1. A minor concern is that the name ParquetFileWrapper would be a big misleading.
  2. A more important on is that this means that if a file has multiple row groups, it will then be open multiple times (each ParquetFileWrapper will has its only self._handle).

I wonder if it makes sense to have ParquetRowGroupWrapper and it takes a ParquetFileWrapper instance as input?

Re: your potential to-do list, here's my thoughts:

  • modify dc2_dm_catalog to use the new version of ParquetFileWrapper and delete the internal one ✔️ I think we should do this in this PR.
  • Assuming previous step has been done, use read_columns_row_group rather than read_columns (only matters if individual parquet files are large enough to merit > 1 row group) ❓ this is related to my high-level question.
  • Extract PathInfoParser into a utility module (could be utils.py or something else) ✅ that's fine. I don't have strong opinion.
  • Modify some existing readers to use it ⌛ fine by me, but it doesn't seem a high priority.

Suggestions gratefully accepted. Thanks, @yyao

Co-authored-by: Yao-Yuan Mao <[email protected]>
@JoanneBogart
Copy link
Contributor Author

@yymao Yes, I think it would be better to restructure Parquet..Wrapper as you suggest. And I will move PathInfoParser out of the reader - that way it's easily available to new readers - but I also think modifying existing readers to use it can wait for another PR.

@JoanneBogart
Copy link
Contributor Author

JoanneBogart commented Jun 23, 2020

One high-level question: do you envision that a ParquetFileWrapper instance represent only one row group? If so, I have two concerns:

A minor concern is that the name ParquetFileWrapper would be a big misleading.
A more important on is that this means that if a file has multiple row groups, it will then be open multiple times (each ParquetFileWrapper will has its only self._handle).

Actually, I'm not sure it's necessary to do this. ParquetFileWrapper is intended to represent a file, not a row group. _iter_native_dataset in my reader has an inner loop over row groups

               for i in range(dataset.num_row_groups):
                      dataset.row_group = i
                      yield dataset

This seems to be doing the right thing without creating multiple instances of the class per file in case there is > 1 row group. However, it's not great that the reader has to include code that maybe more properly belongs in a parquet utility class. For that reason it might be better to reorganize.

@yymao
Copy link
Member

yymao commented Jun 23, 2020

@JoanneBogart Ah, I see. OK that makes sense but maybe we can be more explicitly about this behavior, for example, should we use dataset.current_row_group instead of dataset.row_group?

Also, should the method read_columns_row_group has a row_group argument which is default to None (means reading current_row_group)?

Or something along these lines to make it clear that one doesn't need a new instance for a different row_group.

@JoanneBogart
Copy link
Contributor Author

I need to document usage in parquet.py.

Is there is a use case for calling read_columns_row_group other than from within _obtain_native_data_dict, where it could be useful to specify a particular row group?

@JoanneBogart
Copy link
Contributor Author

@yymao should native filter quantities also be selectable columns (i.e., included in native quantities)? For object catalog I can ask for 'tract', but I don't see where this is coming from - whether a column 'tract' is added to the dataframe or if it's discovered some other way.

@yymao
Copy link
Member

yymao commented Jun 24, 2020

@JoanneBogart re: read_columns_row_group -- it's not clear to me if this method will be used for user supplied row group. My main thinking there is that our wrapper api should be similar to the original api, in which case one can supply row_group argument.

But my high-level comment is just that the attribute and method name should be clear that all row groups should be accessed with just one instance, and this is not so clear right now. In particular, setting row_group attribute before each call of read_columns_row_group is not very intuitive, and on my first read I have thought that each instance can only represent a single row group.

@yymao
Copy link
Member

yymao commented Jun 24, 2020

tract is a column in the parquet file. But native filters do not need to filter on column values ("filters" take care of that). Native filters only need to filter on chunks.

@JoanneBogart
Copy link
Contributor Author

JoanneBogart commented Jun 24, 2020

tract is a column in the parquet file.

Ok; that's what I was beginning to suspect.
As things stand now with galaxy truth, healpix is a native filter, but one can't select it (for a query spanning several healpixels). Maybe that's ok. Otherwise I guess I have to add it as a column in the files or add it on the fly when the data are read in.

Change property name to courrent_row_group
Eliminate unneeded code
@JoanneBogart JoanneBogart marked this pull request as ready for review June 24, 2020 23:59
@yymao
Copy link
Member

yymao commented Jun 26, 2020

@JoanneBogart sorry that I didn't notice you have marked this pull request as ready for review! I'll review it by Monday.

@yymao yymao self-requested a review June 26, 2020 23:33
Copy link
Member

@yymao yymao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JoanneBogart. I've added some minor suggestions/comments, and a question about backward compatibility.

GCRCatalogs/parquet.py Outdated Show resolved Hide resolved
GCRCatalogs/parquet.py Outdated Show resolved Hide resolved
GCRCatalogs/parquet.py Outdated Show resolved Hide resolved
GCRCatalogs/dc2_dm_catalog.py Outdated Show resolved Hide resolved
GCRCatalogs/dc2_dm_catalog.py Outdated Show resolved Hide resolved
Commit reviewer's suggested changes

Co-authored-by: Yao-Yuan Mao <[email protected]>
@JoanneBogart
Copy link
Contributor Author

@yymao thanks for the suggested changes, which I've committed.

I changed DC2DMCatalog to use read_columns_row_group because I thought you had suggested it earlier; I now see a missed an important part of the "suggestion": a question mark.

I don't have strong feelings about it. If someone has code which is depending on the data coming back a full tract (or healpix or whatever) at a time, there could indeed be a compatibility issue. Do you think there is a chance of this? If so, it certainly should be changed back to read_columns.

@yymao
Copy link
Member

yymao commented Jun 28, 2020

@JoanneBogart Thanks. I did suggest that we switch to the new parquet class you wrote in DC2DMCatalog, but was thinking that we'll the behavior the intact.

The chunks that the iterator returns should indeed stay the same, because many add-on catalogs depends on that (they are split into same chunks so that the composite catalog can be efficient).

That said, I am guessing all the current parquet files that we are using all have single row group, but I haven't verified that. I'll be comfortable using read_columns_row_group if that is indeed the case, otherwise we should use read_columns just to be safe (and add a comment to note this choice).

@JoanneBogart
Copy link
Contributor Author

I think I had better change it back to read_columns, even if all parquet files now used in composite catalogs only do have one row group.
It would be nice if, in addition to the data, the iterator could also return a little dict of metadata: native filter names and their values and another entry for row group. Or, rather than change the iterator interace (likely too disruptive), provide a separate function returning a list of these dicts which can be matched up with the chunks of data. But I'm certainly not proposing we do either of these now.

Copy link
Member

@yymao yymao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @JoanneBogart! Approving now.

@JoanneBogart JoanneBogart merged commit b58f99b into master Jun 29, 2020
@JoanneBogart JoanneBogart deleted the u/jrbogart/truth-parquet-reader branch June 29, 2020 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reader Issues related to reader modules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make reader for star truth parquet files
2 participants