U/jrbogart/truth parquet reader #455

JoanneBogart · 2020-06-03T00:04:25Z

Add a reader for simple catalogs persisted as parquet files, such as those for truth tables
fix #445

JoanneBogart · 2020-06-03T17:24:57Z

@yymao Thanks for being willing to take a look! I hope the summary below makes your job a little easier

As far as a I know the reader works satisfactorily for simple catalogs with no native filters (truth summary and variability for each of sn and star).
I made galaxy truth summary catalogs for three healpixels so far (out of 131 total; it is straightforward to make the rest) so I could test a catalog with a native filter. So far I can load the catalog and get data, but queries with a condition on a native filter fail. I could be making the query wrong, but more likely there is a bug in the reader.
the class used for parsing the file pattern and finding native filters is general. The pattern as it appears in the config file may be a totally standard regular expressions, using named groups for native filter names, or it will accept a shorthand {healpix} or, more generally, {some_name} and translate to standard re notation with a named group. Do we want to use this in other places? And if so, does the translation feature seem helpful or just confusing?
I would like to (also) separate out some version of ParquetFileWrapper so it can be used by different readers.
The last two points likely have some bearing on issue Establish a standardized catalog config specification and corresponding validation tests #448

JoanneBogart · 2020-06-04T01:00:29Z

Latest commit fixed the native filter query problem.

Eliminate debug output from truth parquet reader Add catalog for star lc stats truth Rename truth catalogs to include full run designation

yymao · 2020-06-07T19:25:06Z

@JoanneBogart Thank you for the draft PR and for your patience! I haven't done a detailed review but have two high-level comments:

First, I see what you are saying about making ParquetFileWrapper a class in some utility model. I agree with you, and wonder if we should just do it now. My gut feeling is that the incentive of doing so would decrease with time, and we will end up having several slightly different ParquetFileWrapper classes.

Of course I am also sensitive to our limited time/resource. An alternative suggestion is to make the ParquetFileWrapper a subclass of dc2_dm_catalog.ParquetFileWrapper. This should be straightforward and it will gives future devs a hint that, if they need another similar class, they should either make another subclass, or merge them into a utility class.

The ParsePathInfo class is a clever approach (but can we call it PathInfoParser so that it's a noun?). Yesterday I was actually just reading the ParquetDataset feature in pyarrow and wonder if there's something we can borrow from?

I haven't looked into the native filter issue, but will do. More later.

JoanneBogart · 2020-06-08T17:27:02Z

@yymao Thanks for your comments! I know you're heavily over-subscribed and I really appreciate that you found some time for this.
I prefer the first approach (separate out ParquetFileWrapper into a utility module); I can start work on that. Any suggestions on naming, overall structure, etc.?
I'll change the name to PathInfoParser and look up your reference; thanks.
Since my last commit the native filter has been working properly. The problem was a missing conversion of a string representation of the value to int.

yymao · 2020-06-08T17:48:21Z

Thank you @JoanneBogart! We do have a utils.py module but it doesn't feel right to put ParquetFileWrapper there. Maybe it could just be parquet.py? If in the future we have more common classes, we probably won't put them all into the same module? Other ideas are welcome.

Glad to hear the native filter issue is resolved!

JoanneBogart · 2020-06-11T22:17:23Z

@yymao I've moved ParquetFileWrapper to parquet.py and renamed the class ParsePathInfo --> PathInfoParser
Should I go any further in this PR? Possible steps to consider are

modify dc2_dm_catalog to use the new version of ParquetFileWrapper and delete the internal one
Assuming previous step has been done, use read_columns_row_group rather than read_columns (only matters if individual parquet files are large enough to merit > 1 row group)
Extract PathInfoParser into a utility module (could be utils.py or something else)
Modify some existing readers to use it

My feeling is that the ParquetFileWrapper items might be appropriate to do now - the changes involved are minimal - but PathInfoParser should wait until we've defined more or less what the new keyword structure for configs will be.

JoanneBogart · 2020-06-16T00:09:01Z

@yymao The ParquetDataset feature does look interesting. It probably is faster and less memory intensive to let pyarrow do the cuts when possible, but it's not as general as GCRQuery. Is it adequate for what people typically do? Could it be invoked for those cases it covers (which would mean recognizing them) and fall back to current code for the rest?

yymao · 2020-06-16T00:24:53Z

@JoanneBogart -- thanks for the update and patience! I am still a bit swamped now, but will get back to this later this week!

yymao

I added some comments on PathInfoParser

GCRCatalogs/dc2_truth_parquet.py

yymao · 2020-06-23T01:00:33Z

@JoanneBogart thanks for waiting. One high-level question: do you envision that a ParquetFileWrapper instance represent only one row group? If so, I have two concerns:

A minor concern is that the name ParquetFileWrapper would be a big misleading.
A more important on is that this means that if a file has multiple row groups, it will then be open multiple times (each ParquetFileWrapper will has its only self._handle).

I wonder if it makes sense to have ParquetRowGroupWrapper and it takes a ParquetFileWrapper instance as input?

Re: your potential to-do list, here's my thoughts:

modify dc2_dm_catalog to use the new version of ParquetFileWrapper and delete the internal one ✔️ I think we should do this in this PR.
Assuming previous step has been done, use read_columns_row_group rather than read_columns (only matters if individual parquet files are large enough to merit > 1 row group) ❓ this is related to my high-level question.
Extract PathInfoParser into a utility module (could be utils.py or something else) ✅ that's fine. I don't have strong opinion.
Modify some existing readers to use it ⌛ fine by me, but it doesn't seem a high priority.

@yyao

Suggestions gratefully accepted. Thanks, @yyao Co-authored-by: Yao-Yuan Mao <[email protected]>

JoanneBogart · 2020-06-23T17:40:26Z

@yymao Yes, I think it would be better to restructure Parquet..Wrapper as you suggest. And I will move PathInfoParser out of the reader - that way it's easily available to new readers - but I also think modifying existing readers to use it can wait for another PR.

JoanneBogart · 2020-06-23T18:54:52Z

One high-level question: do you envision that a ParquetFileWrapper instance represent only one row group? If so, I have two concerns:

A minor concern is that the name ParquetFileWrapper would be a big misleading.
A more important on is that this means that if a file has multiple row groups, it will then be open multiple times (each ParquetFileWrapper will has its only self._handle).

Actually, I'm not sure it's necessary to do this. ParquetFileWrapper is intended to represent a file, not a row group. _iter_native_dataset in my reader has an inner loop over row groups

               for i in range(dataset.num_row_groups):
                      dataset.row_group = i
                      yield dataset

This seems to be doing the right thing without creating multiple instances of the class per file in case there is > 1 row group. However, it's not great that the reader has to include code that maybe more properly belongs in a parquet utility class. For that reason it might be better to reorganize.

yymao · 2020-06-23T20:05:58Z

@JoanneBogart Ah, I see. OK that makes sense but maybe we can be more explicitly about this behavior, for example, should we use dataset.current_row_group instead of dataset.row_group?

Also, should the method read_columns_row_group has a row_group argument which is default to None (means reading current_row_group)?

Or something along these lines to make it clear that one doesn't need a new instance for a different row_group.

JoanneBogart · 2020-06-23T22:08:38Z

I need to document usage in parquet.py.

Is there is a use case for calling read_columns_row_group other than from within _obtain_native_data_dict, where it could be useful to specify a particular row group?

JoanneBogart · 2020-06-23T23:50:31Z

@yymao should native filter quantities also be selectable columns (i.e., included in native quantities)? For object catalog I can ask for 'tract', but I don't see where this is coming from - whether a column 'tract' is added to the dataframe or if it's discovered some other way.

yymao · 2020-06-24T00:22:34Z

@JoanneBogart re: read_columns_row_group -- it's not clear to me if this method will be used for user supplied row group. My main thinking there is that our wrapper api should be similar to the original api, in which case one can supply row_group argument.

But my high-level comment is just that the attribute and method name should be clear that all row groups should be accessed with just one instance, and this is not so clear right now. In particular, setting row_group attribute before each call of read_columns_row_group is not very intuitive, and on my first read I have thought that each instance can only represent a single row group.

yymao · 2020-06-24T00:23:56Z

tract is a column in the parquet file. But native filters do not need to filter on column values ("filters" take care of that). Native filters only need to filter on chunks.

JoanneBogart · 2020-06-24T01:12:17Z

tract is a column in the parquet file.

Ok; that's what I was beginning to suspect.
As things stand now with galaxy truth, healpix is a native filter, but one can't select it (for a query spanning several healpixels). Maybe that's ok. Otherwise I guess I have to add it as a column in the files or add it on the fly when the data are read in.

Change property name to courrent_row_group Eliminate unneeded code

yymao · 2020-06-26T23:33:46Z

@JoanneBogart sorry that I didn't notice you have marked this pull request as ready for review! I'll review it by Monday.

yymao

Thanks @JoanneBogart. I've added some minor suggestions/comments, and a question about backward compatibility.

GCRCatalogs/catalog_configs/dc2_truth_run2.2i_galaxy_truth_summary.yaml

GCRCatalogs/parquet.py

GCRCatalogs/dc2_dm_catalog.py

Commit reviewer's suggested changes Co-authored-by: Yao-Yuan Mao <[email protected]>

JoanneBogart · 2020-06-28T22:39:22Z

@yymao thanks for the suggested changes, which I've committed.

I changed DC2DMCatalog to use read_columns_row_group because I thought you had suggested it earlier; I now see a missed an important part of the "suggestion": a question mark.

I don't have strong feelings about it. If someone has code which is depending on the data coming back a full tract (or healpix or whatever) at a time, there could indeed be a compatibility issue. Do you think there is a chance of this? If so, it certainly should be changed back to read_columns.

yymao · 2020-06-28T22:50:02Z

@JoanneBogart Thanks. I did suggest that we switch to the new parquet class you wrote in DC2DMCatalog, but was thinking that we'll the behavior the intact.

The chunks that the iterator returns should indeed stay the same, because many add-on catalogs depends on that (they are split into same chunks so that the composite catalog can be efficient).

That said, I am guessing all the current parquet files that we are using all have single row group, but I haven't verified that. I'll be comfortable using read_columns_row_group if that is indeed the case, otherwise we should use read_columns just to be safe (and add a comment to note this choice).

JoanneBogart · 2020-06-28T23:41:41Z

I think I had better change it back to read_columns, even if all parquet files now used in composite catalogs only do have one row group.
It would be nice if, in addition to the data, the iterator could also return a little dict of metadata: native filter names and their values and another entry for row group. Or, rather than change the iterator interace (likely too disruptive), provide a separate function returning a list of these dicts which can be matched up with the chunks of data. But I'm certainly not proposing we do either of these now.

…_columns_row_group

yymao

Thanks, @JoanneBogart! Approving now.

JoanneBogart added 4 commits June 1, 2020 15:35

first draft of reader for parquet truth

58fa7ad

iterate over row groups

2e0d44e

make row_group a property; handle escapes properly in re strings

20d0097

try harder to pacify pylint

2fad4d9

JoanneBogart added the reader Issues related to reader modules label Jun 3, 2020

Add galaxy truth config; minor updates to reader

f96027a

fix native filters bug

f249417

Update galaxy truth base_dir to production location

db9fc8c

Eliminate debug output from truth parquet reader Add catalog for star lc stats truth Rename truth catalogs to include full run designation

break out ParquetFileWrapper into its own module

70501f8

yymao reviewed Jun 23, 2020

View reviewed changes

GCRCatalogs/dc2_truth_parquet.py Outdated Show resolved Hide resolved

GCRCatalogs/dc2_truth_parquet.py Outdated Show resolved Hide resolved

GCRCatalogs/dc2_truth_parquet.py Outdated Show resolved Hide resolved

Apply suggestions from code review

dd3761c

Suggestions gratefully accepted. Thanks, @yyao Co-authored-by: Yao-Yuan Mao <[email protected]>

JoanneBogart added 2 commits June 24, 2020 15:28

Document ParquetFileWrapper class

c08312b

Change property name to courrent_row_group Eliminate unneeded code

move PathInfoParser class into its own module

10f86eb

JoanneBogart added 2 commits June 24, 2020 16:40

Use parquet.ParquetFileWrapper

91c10dd

Add needed import

d1f57b5

JoanneBogart marked this pull request as ready for review June 24, 2020 23:59

yymao self-requested a review June 26, 2020 23:33

yymao requested changes Jun 27, 2020

View reviewed changes

Apply suggestions from code review

a8519b6

Commit reviewer's suggested changes Co-authored-by: Yao-Yuan Mao <[email protected]>

Ensure backwards compatibility by using read_columns rather than read…

d342b4c

…_columns_row_group

yymao approved these changes Jun 29, 2020

View reviewed changes

JoanneBogart merged commit b58f99b into master Jun 29, 2020

JoanneBogart deleted the u/jrbogart/truth-parquet-reader branch June 29, 2020 01:24

yymao mentioned this pull request Jul 18, 2020

Generate 2.1i truth catalogs (for objects not in cosmoDC2) LSSTDESC/DC2-production#346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

U/jrbogart/truth parquet reader #455

U/jrbogart/truth parquet reader #455

JoanneBogart commented Jun 3, 2020

JoanneBogart commented Jun 3, 2020 •

edited

Loading

JoanneBogart commented Jun 4, 2020

yymao commented Jun 7, 2020

JoanneBogart commented Jun 8, 2020

yymao commented Jun 8, 2020

JoanneBogart commented Jun 11, 2020 •

edited

Loading

JoanneBogart commented Jun 16, 2020

yymao commented Jun 16, 2020

yymao left a comment

yymao commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020 •

edited

Loading

yymao commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020

yymao commented Jun 24, 2020

yymao commented Jun 24, 2020

JoanneBogart commented Jun 24, 2020 •

edited

Loading

yymao commented Jun 26, 2020

yymao left a comment

JoanneBogart commented Jun 28, 2020

yymao commented Jun 28, 2020

JoanneBogart commented Jun 28, 2020

yymao left a comment

U/jrbogart/truth parquet reader #455

U/jrbogart/truth parquet reader #455

Conversation

JoanneBogart commented Jun 3, 2020

JoanneBogart commented Jun 3, 2020 • edited Loading

JoanneBogart commented Jun 4, 2020

yymao commented Jun 7, 2020

JoanneBogart commented Jun 8, 2020

yymao commented Jun 8, 2020

JoanneBogart commented Jun 11, 2020 • edited Loading

JoanneBogart commented Jun 16, 2020

yymao commented Jun 16, 2020

yymao left a comment

Choose a reason for hiding this comment

yymao commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020 • edited Loading

yymao commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020

JoanneBogart commented Jun 23, 2020

yymao commented Jun 24, 2020

yymao commented Jun 24, 2020

JoanneBogart commented Jun 24, 2020 • edited Loading

yymao commented Jun 26, 2020

yymao left a comment

Choose a reason for hiding this comment

JoanneBogart commented Jun 28, 2020

yymao commented Jun 28, 2020

JoanneBogart commented Jun 28, 2020

yymao left a comment

Choose a reason for hiding this comment

JoanneBogart commented Jun 3, 2020 •

edited

Loading

JoanneBogart commented Jun 11, 2020 •

edited

Loading

JoanneBogart commented Jun 23, 2020 •

edited

Loading

JoanneBogart commented Jun 24, 2020 •

edited

Loading