Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ::availableChunks call to Record Component types #802

Merged
merged 37 commits into from
Dec 4, 2020

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Oct 20, 2020

This allows users to inquire chunks withing datasets that are ready to be loaded. This helps in several situations:

  • If a dataset does not carry data at all indices, inquire the positions where data is held.
  • Allow for a more efficient loading of data by considering the write pattern in determining a read pattern.

Todo:

  • Implement for backends other than ADIOS2
  • Documentation
  • Cleanup

@ax3l ax3l self-requested a review October 21, 2020 05:46
@ax3l ax3l self-assigned this Oct 21, 2020
@ax3l ax3l added the api: new additions to the API label Oct 21, 2020
@franzpoeschel
Copy link
Contributor Author

The only thing that I really find on this for HDF5 are dataset chunks, which seems to be related only by name?
If HDF5 had something like this, I would suspect to find it here, but there's nothing that looks useful…

@franzpoeschel franzpoeschel force-pushed the topic-available-chunks branch 2 times, most recently from 20ffd7f to 7a4ed99 Compare October 26, 2020 16:33
@ax3l
Copy link
Member

ax3l commented Oct 27, 2020

You could return the information we have from particle patches (we don't currently store this explicitly for fields).

HDF5 chunks are relatively small but similar in thought, yes! We currently don't write in chunks but should do so later on #406

Logic one could do (top-to-bottom fall-though):

  • particle patches present: return chunks stored in particle patches
  • HDF5: return chunking or full extent if no chunking present; ADIOS return PGs/blocks
  • return one large chunk

The general fallback for now could be the last option; if we can already return HDF5 chunks that would be cool, too.

@franzpoeschel franzpoeschel mentioned this pull request Oct 30, 2020
5 tasks
@franzpoeschel franzpoeschel force-pushed the topic-available-chunks branch 2 times, most recently from 644b1c3 to d7240d2 Compare November 2, 2020 11:31
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 9, 2020

Just putting this link here since it took me 15 minutes to find any documentation on any of the chunk querying functions: https://hdf5.io/develop/group___h5_d.html#gaccff213d3e0765b86f66d08dd9959807

@franzpoeschel
Copy link
Contributor Author

You could return the information we have from particle patches (we don't currently store this explicitly for fields).

HDF5 chunks are relatively small but similar in thought, yes! We currently don't write in chunks but should do so later on #406

Logic one could do (top-to-bottom fall-though):

* particle patches present: return chunks stored in particle patches

RecordComponent::availableChunks is information carried by the backends independently of the openPMD structure whereas particle patches are explicitly stored by the user within terms defined by the openPMD standard. I think we should keep both concepts separate.

* HDF5: return chunking or full extent if no chunking present; ADIOS return PGs/blocks

I've added detection on whether a dataset is chunked or not. Since we don't actually write chunked datasets in HDF5 (the corresponding code in ::writeDataset is commented out), I've only implemented your last suggestion below, and use the chunking information solely to print different warnings.

* return one large chunk

The general fallback for now could be the last option; if we can already return HDF5 chunks that would be cool, too.

@ax3l
Copy link
Member

ax3l commented Nov 10, 2020

We discussed offline, so I am just quickly summarizing a selected point here:

RecordComponent::availableChunks is information carried by the backends independently of the openPMD structure whereas particle patches are explicitly stored by the user within terms defined by the openPMD standard. I think we should keep both concepts separate.

The whole idea about particle patches in the standard is that we find again chunks to read :-) That's how I used them in the past and are the motivating example for the standardization. The currently implementation does not make us of that yet, but we can potentially put a high-level layer over the current API to make it so in the future.

  • return one large chunk

The general fallback for now could be the last option; if we can already return HDF5 chunks that would be cool, too.

@franzpoeschel will as follow-up provide a function that receives (large) chunks and can chunk them into finer (or N>1) chunks for processing, with user-provided controls (e.g. maximum individual chunk extent).

include/openPMD/Dataset.hpp Outdated Show resolved Hide resolved
Comment on lines 72 to 75
* openPMD-related information. Especially is this call explicitly unrelated
* to openPMD's concept of particle patches, which users may additionally
* wish to use to store user-defined, backend-independent chunking
* information on particle datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this, because the motiviation of particlePatches is exactly to feed a function like availableChunks.
It just was not implemented in openPMD-api that way, yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is fitting for the way that things are implemented right now. So, if changing this, should we also change the implementation? To what?

We could maybe think about doing a more high-level approach in a fallthrough manner:

  1. Query available chunks from backend
  2. If the backend doesn't know, query particle patches (ordering? this first?)
  3. If none of this works, return whole dataset as one chunk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the formulation Especially is this call explicitly unrelated to openPMD's concept of particle patches can be changed to Note that this call currently does not take into account the openPMD concept of particle patches....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently does not take into account the openPMD concept of particle patches

Alright, that's an easier fix for sure.

But the question still stands whether it would be wise to mix both notions one day. What if both particle patches and backend-specific chunks are available on a dataset? Then..

  1. … both information may contradict each other. Especially given the clear semantical gap between chunks in HDF5 and particle patches, they're not even trying to achieve the same thing.
  2. … we would need to pick one of both to return to the user. It's not really our place to make that decision because depending on the use case, either of both may be more interesting:
    • If optimizing for efficiency, throughput and data distribution, loading data in a way suggested by the backend (and with knowledge on writing ranks) is to be preferred.
    • If interested in physical quantities, such as the hyperrectangle containing a particle path, loading the particle patches is to be preferred.

This is not saying that I am totally against it, but I think that those points should be considered. Rough draft:

  1. If a user explicitly defines particle patches, this should count as explicitly overwriting the chunk information given implicitly by the backend. So if we decide to prioritize one over the other, we should probably prioritize the particle patches over the backend-dependent info.
  2. If we prioritize particle patches, they should in general carry information at least as useful as the backend-dependent chunk info. So, standardize some kind of writerID attribute.
  3. Since both notions of chunks may and will differ from one another, allow users to explicitly prefer the backend-dependent information if they wish. Something like ::availableChunks( ChunksInfoSource = ChunksInfoSource::ParticlePatches ) could do.

Copy link
Member

@ax3l ax3l Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with your analysis. Let's cross that bridge when we come to this, in practice we need particle patches for checkpoint-restart workflows and selection workflows to avoid iterating the whole particle table in massively parallel context. It also served as a good work-around for distributed HDF5 data on parallel filesystems, but that's not a primary concern if we have backend info, i.e. in ADIOS2.

Can we discuss this and questions on writerID in a VC? I have some experiences on how useful/not useful that would be in practice and need to fully understand how you want to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking PR #824 here for an example on how I am intending to use this. Tldr: writerID by itself is only an identifier that we need to make sense of by adding meta information, such as e.g. hostnames, or similar information that we may wish to rely upon for chunk distribution strategies.

src/Dataset.cpp Outdated Show resolved Hide resolved
include/openPMD/Dataset.hpp Outdated Show resolved Hide resolved
src/IO/ADIOS/ADIOS2IOHandler.cpp Outdated Show resolved Hide resolved
Comment on lines 412 to 416
std::cerr << R"END(
[HDF5] Warning: HDF5 does not carry information on available chunks.
Returning instead the whole dataset as one chunk.
Some positions in the chunk may hold undefined data.
)END" << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thing this justifies a warning: no-chunks in a file can be silently interpreted as one chunk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has more to do with what I wrote below: In other backends, the chunks returned by ::availableChunks are chunks that have actually been written to. In HDF5 on the other hand, calling this function will give you one large chunk, possibly containing data regions that the writer never filled.

Maybe instead make this more explicit in the documentation of ::availableChunks and drop the warning?

Copy link
Member

@ax3l ax3l Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a misunderstanding. HDF5's concept of chunks is not related to what ADIOS does with blocks/PGs.

HDF5 chunks are just low-level memory blocks, similar to FS block sizes. They are not variable over the data set and are mainly a performance tuning / compression-enabling aspect of low-level HDF5 tuning and they can also be used to define a variable size (extent) on an HDF5 data set.

Maybe we should have called openPMD-api "chunks" "boxes" or "blocks" to avoid the confusion.

Maybe instead make this more explicit in the documentation of ::availableChunks and drop the warning?

Yes, would say so. This is generally true for incompletely written data sets, even without the new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would then suggest, dropping both warnings and putting a note in the doxy string of ::availableChunks, that HDF5 chunks are currently completely ignored. Implementing HDF5 chunking for reading and writing is better as a follow-up PR imo.

Copy link
Member

@ax3l ax3l Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, yes. #406

Comment on lines 424 to 428
[HDF5] Warning: HDF5 dataset has chunked layout. Since the openPMD API does not
support writing chunked HDF5 datasets, the whole dataset will be
returned as one chunk.
Some positions in the chunk may hold undefined data.
)END" << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this warning.
A user reads a file from elsewhere with openPMD-api and we say: "we found chunks, but we don't tell you, since we cannot write chunks." That makes no sense to me.
I am also not sure about this statement:

Some positions in the chunk may hold undefined data.

where did you find that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this warning.
A user reads a file from elsewhere with openPMD-api and we say: "we found chunks, but we don't tell you, since we cannot write chunks." That makes no sense to me.

Simple reason: I can't really test an implementation of that easily at the moment, since I would need to create a chunked dataset for that, which the openPMD API doesn't currently do. If the openPMD API doesn't support writing chunks yet, it's fair that reading them is unsupported as well?

Some positions in the chunk may hold undefined data.

Undefined as in "not defined explicitly by the writer". In ADIOS, if the writer writes data only to a subset of the available positions in the dataset, ::availableChunks will only return positions in the dataset that have been written to. In HDF5, no such checks are done. Returning the whole dataset as one chunks means that some positions in the chunk may not hold sensible data if the dataset has been written partially only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a confusion on what HDF5 chunks actually are - please see explanation above. They are not as flexible/useful as ADIOS blocks/PGs, but rather a low-level storage detail, i.e. for compression and (in size) extensible data sets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a confusion on what HDF5 chunks actually are

They are certainly different in meaning from chunks in ADIOS, which should be noted somewhere a user will see it.

src/IO/JSON/JSONIOHandlerImpl.cpp Outdated Show resolved Hide resolved
Comment on lines +1637 to +1630
* Explicitly convert things to bool, so Catch doesn't get the splendid
* idea to print the Chunk struct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it need for that, an ostream operator<< implementation for Chunk? Could be generally useful for that type:
https://github.com/catchorg/Catch2/blob/devel/docs/tostring.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the failing builds any more, think they got lost in rebasing. It looked more like a linker bug in Catch that only came up in MSVC. g++ and clang had no issues with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's try to remove this work-around then and let us compare the vars directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, comparing the members of Chunk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this should probably read:

        REQUIRE(
            table[ 0 ] == WrittenChunkInfo( { 0, 0 }, { height, 4 } ) );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had originally, and I've re-run it to trigger the same error again: MSVC doesn't like that.

SerialIOTest.obj : error LNK2019: unresolved external symbol "class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const Catch::Detail::unprintableString" (?unprintableString@Detail@Catch@@3V?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@B) referenced in function "class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > __cdecl Catch::Detail::convertUnstreamable<struct openPMD::WrittenChunkInfo>(struct openPMD::WrittenChunkInfo const &)" (??$convertUnstreamable@UWrittenChunkInfo@openPMD@@@Detail@Catch@@YA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@ABUWrittenChunkInfo@openPMD@@@Z) [C:\projects\openpmd-api\build\SerialIOTests.vcxproj]
C:\projects\openpmd-api\build\bin\Release\SerialIOTests.exe : fatal error LNK1120: 1 unresolved externals [C:\projects\openpmd-api\build\SerialIOTests.vcxproj]

For some reason, MSVC won't find this symbol, and I don't really know if it's really worth hunting down the reason for this error within a 15k line file of inlined code..?

Copy link
Member

@ax3l ax3l Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed today: seems this is indeed more challenging than implementing an operator<<, so we stay with the prior work-around.

@franzpoeschel can you please update? :)

@ax3l
Copy link
Member

ax3l commented Nov 13, 2020

General "Chunk" question:
I think naming the new meta-data class "Chunk" could be confusing: users might ask - is a chunk the "data chunk" (loadChunk/storeChunk) or the meta-data (box with start & size) that describes the chunk. What do you think? Box or Selection could be a simpler name for the new class, maybe.

@franzpoeschel
Copy link
Contributor Author

I've renamed things as ChunkInfo and WrittenChunkInfo now. That fine? Still need to update the filenames.

src/Chunk.cpp Outdated Show resolved Hide resolved
test/name_tests.sh Outdated Show resolved Hide resolved
* and gather the data items into chunks
* Example dataset:
*
* 0123
Copy link
Member

@ax3l ax3l Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have not yet seen a JSON file with multiple storeChunk parts in it - can you please provide me with an example so I can follow a bit better? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm.. aren't you looking at an example? Or what would you like to see additionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the JSON dataset that you'd like to see:

          "x": {     
            "attributes": {
              "position": {
                "datatype": "VEC_DOUBLE",
                "value": [
                  0
                ]    
              }, 
              "unitSI": {
                "datatype": "DOUBLE",
                "value": 1
              }      
            },
            "data": [                                                                                   
              [                                                                                         
                null,                                                                                   
                null,                                                                                   
                null,                                                                                   
                null                                                                                    
              ],                                                                                        
              [                                                                                         
                null,                                                                                   
                null,                                                                                   
                null,                                                                                   
                null                                                                                    
              ],                                                                                        
              [                                                                                         
                2,                                                                                      
                4,                                                                                      
                6,                                                                                      
                8                                                                                       
              ],                                                                                        
              [                                                                                         
                2,                                                                                      
                4,                                                                                      
                6,                                                                                      
                8                                                                                       
              ],                                                                                        
              [                                                                                         
                2,                                                                                      
                4,                                                                                      
                6,                                                                                      
                8
              ],
              [
                2,
                4,
                6,
                8
              ],
              [
                2,
                4,
                6,
                8
              ],
              [
                2,
                4,
                null,
                null
              ],
              [
                2,
                4,
                null,
                2
              ],
              [
                null,
                null,
                null,
                4
              ]
            ],
            "datatype": "INT"
          }

So, you get nulls in unwritten positions. Nulls don't get "folded up" if a whole dimension is null, since the JSON backend takes care to fill every position in the n-d dataset at least with nulls.

Copy link
Member

@ax3l ax3l Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's what I am familiar with. I somehow don't get the json "chunks" implementation then. I guess there is always only one chunk? That's what I would expect at least and what seems reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that would be a possibility. But in JSON, it's possible to reconstruct chunks such that exactly those positions are covered that are non-null, so I figured I'd just do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it now - smart! 🎉

@ax3l
Copy link
Member

ax3l commented Nov 13, 2020

I've renamed things as ChunkInfo and WrittenChunkInfo now. That fine? Still need to update the filenames.

Ok, I think that works, although Info is a bit of a generic qualifier. ChunkBox or ChunkSelection, maybe?

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 16, 2020

I can't really shake the feeling that ChunkBox sounds a bit.. double? But I think we can go with it. I like it more than ChunkSelection, mainly because a derived class WrittenChunkSelection sounds like I'm back at writing Java.

Maybe ChunkMeta if that's not too generic, too? At least, that expresses directly that we're not talking about the actual chunk.

include/openPMD/Chunk.hpp Outdated Show resolved Hide resolved
include/openPMD/Chunk.hpp Outdated Show resolved Hide resolved
Implemented only for ADIOS2 so far
This PR is extracted from my streaming branch and did not need all of
the things I put in it.
Comment on lines 295 to 299
int __mpi_rank{ -1 }, __mpi_size{ -1 };
MPI_Comm_rank( MPI_COMM_WORLD, &__mpi_rank );
MPI_Comm_size( MPI_COMM_WORLD, &__mpi_size );
unsigned mpi_rank{ static_cast< unsigned >( __mpi_rank ) },
mpi_size{ static_cast< unsigned >( __mpi_size ) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slipped through: reserved naming.

Idea (r for read or so) - did not check all the member vars and hope it does not shadow anything this way:

Suggested change
int __mpi_rank{ -1 }, __mpi_size{ -1 };
MPI_Comm_rank( MPI_COMM_WORLD, &__mpi_rank );
MPI_Comm_size( MPI_COMM_WORLD, &__mpi_size );
unsigned mpi_rank{ static_cast< unsigned >( __mpi_rank ) },
mpi_size{ static_cast< unsigned >( __mpi_size ) };
int r_mpi_rank{ -1 }, r_mpi_size{ -1 };
MPI_Comm_rank( MPI_COMM_WORLD, &r_mpi_rank );
MPI_Comm_size( MPI_COMM_WORLD, &r_mpi_size );
unsigned mpi_rank{ static_cast< unsigned >( r_mpi_rank ) },
mpi_size{ static_cast< unsigned >(r_mpi_size ) };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that works

Comment on lines 42 to 44
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; })
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; })
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: property_ is only needed if one wants to wrap a getter/setter method into a direct property. Super useful, but we expose the whole field directly here since it is already public in C++.

Suggested change
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; })
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; })
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the explanation. I left it commented out for now so we could check which one we would prefer :)

Comment on lines 59 to 61
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; })
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; })
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; })
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; })
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; })

Comment on lines 104 to 105
* Since the chunks are reconstructed, they need not necessarily
* correspond with the way that the chunks are written.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Since the chunks are reconstructed, they need not necessarily
* correspond with the way that the chunks are written.
* Since the chunks are reconstructed, they won't necessarily
* correspond with the way that the chunks were written.

@ax3l
Copy link
Member

ax3l commented Nov 25, 2020

I've renamed things as ChunkInfo and WrittenChunkInfo now. That fine? Still need to update the filenames.

Looks good to me, let's not forget to rename the file(s), too.

@franzpoeschel franzpoeschel mentioned this pull request Nov 26, 2020
6 tasks
@ax3l
Copy link
Member

ax3l commented Dec 4, 2020

Thank you, great work! ✨

@ax3l ax3l merged commit 30a2616 into openPMD:dev Dec 4, 2020
@ax3l ax3l mentioned this pull request Dec 4, 2020
@franzpoeschel franzpoeschel deleted the topic-available-chunks branch January 28, 2021 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: new additions to the API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants