-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ::availableChunks call to Record Component types #802
Conversation
bb00a12
to
11fd514
Compare
The only thing that I really find on this for HDF5 are dataset chunks, which seems to be related only by name? |
20ffd7f
to
7a4ed99
Compare
You could return the information we have from particle patches (we don't currently store this explicitly for fields). HDF5 chunks are relatively small but similar in thought, yes! We currently don't write in chunks but should do so later on #406 Logic one could do (top-to-bottom fall-though):
The general fallback for now could be the last option; if we can already return HDF5 chunks that would be cool, too. |
644b1c3
to
d7240d2
Compare
Just putting this link here since it took me 15 minutes to find any documentation on any of the chunk querying functions: https://hdf5.io/develop/group___h5_d.html#gaccff213d3e0765b86f66d08dd9959807 |
b3c9166
to
edaa266
Compare
I've added detection on whether a dataset is chunked or not. Since we don't actually write chunked datasets in HDF5 (the corresponding code in
|
56fe42e
to
9aa2f2c
Compare
We discussed offline, so I am just quickly summarizing a selected point here:
The whole idea about particle patches in the standard is that we find again chunks to read :-) That's how I used them in the past and are the motivating example for the standardization. The currently implementation does not make us of that yet, but we can potentially put a high-level layer over the current API to make it so in the future.
@franzpoeschel will as follow-up provide a function that receives (large) chunks and can chunk them into finer (or N>1) chunks for processing, with user-provided controls (e.g. maximum individual chunk extent). |
* openPMD-related information. Especially is this call explicitly unrelated | ||
* to openPMD's concept of particle patches, which users may additionally | ||
* wish to use to store user-defined, backend-independent chunking | ||
* information on particle datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this, because the motiviation of particlePatches is exactly to feed a function like availableChunks
.
It just was not implemented in openPMD-api that way, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is fitting for the way that things are implemented right now. So, if changing this, should we also change the implementation? To what?
We could maybe think about doing a more high-level approach in a fallthrough manner:
- Query available chunks from backend
- If the backend doesn't know, query particle patches (ordering? this first?)
- If none of this works, return whole dataset as one chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the formulation Especially is this call explicitly unrelated to openPMD's concept of particle patches
can be changed to Note that this call currently does not take into account the openPMD concept of particle patches...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently does not take into account the openPMD concept of particle patches
Alright, that's an easier fix for sure.
But the question still stands whether it would be wise to mix both notions one day. What if both particle patches and backend-specific chunks are available on a dataset? Then..
- … both information may contradict each other. Especially given the clear semantical gap between chunks in HDF5 and particle patches, they're not even trying to achieve the same thing.
- … we would need to pick one of both to return to the user. It's not really our place to make that decision because depending on the use case, either of both may be more interesting:
- If optimizing for efficiency, throughput and data distribution, loading data in a way suggested by the backend (and with knowledge on writing ranks) is to be preferred.
- If interested in physical quantities, such as the hyperrectangle containing a particle path, loading the particle patches is to be preferred.
This is not saying that I am totally against it, but I think that those points should be considered. Rough draft:
- If a user explicitly defines particle patches, this should count as explicitly overwriting the chunk information given implicitly by the backend. So if we decide to prioritize one over the other, we should probably prioritize the particle patches over the backend-dependent info.
- If we prioritize particle patches, they should in general carry information at least as useful as the backend-dependent chunk info. So, standardize some kind of
writerID
attribute. - Since both notions of chunks may and will differ from one another, allow users to explicitly prefer the backend-dependent information if they wish. Something like
::availableChunks( ChunksInfoSource = ChunksInfoSource::ParticlePatches )
could do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur with your analysis. Let's cross that bridge when we come to this, in practice we need particle patches for checkpoint-restart workflows and selection workflows to avoid iterating the whole particle table in massively parallel context. It also served as a good work-around for distributed HDF5 data on parallel filesystems, but that's not a primary concern if we have backend info, i.e. in ADIOS2.
Can we discuss this and questions on writerID
in a VC? I have some experiences on how useful/not useful that would be in practice and need to fully understand how you want to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking PR #824 here for an example on how I am intending to use this. Tldr: writerID
by itself is only an identifier that we need to make sense of by adding meta information, such as e.g. hostnames, or similar information that we may wish to rely upon for chunk distribution strategies.
src/IO/HDF5/HDF5IOHandler.cpp
Outdated
std::cerr << R"END( | ||
[HDF5] Warning: HDF5 does not carry information on available chunks. | ||
Returning instead the whole dataset as one chunk. | ||
Some positions in the chunk may hold undefined data. | ||
)END" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't thing this justifies a warning: no-chunks in a file can be silently interpreted as one chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one has more to do with what I wrote below: In other backends, the chunks returned by ::availableChunks
are chunks that have actually been written to. In HDF5 on the other hand, calling this function will give you one large chunk, possibly containing data regions that the writer never filled.
Maybe instead make this more explicit in the documentation of ::availableChunks
and drop the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a misunderstanding. HDF5's concept of chunks is not related to what ADIOS does with blocks/PGs.
HDF5 chunks are just low-level memory blocks, similar to FS block sizes. They are not variable over the data set and are mainly a performance tuning / compression-enabling aspect of low-level HDF5 tuning and they can also be used to define a variable size (extent) on an HDF5 data set.
Maybe we should have called openPMD-api "chunks" "boxes" or "blocks" to avoid the confusion.
Maybe instead make this more explicit in the documentation of ::availableChunks and drop the warning?
Yes, would say so. This is generally true for incompletely written data sets, even without the new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would then suggest, dropping both warnings and putting a note in the doxy string of ::availableChunks
, that HDF5 chunks are currently completely ignored. Implementing HDF5 chunking for reading and writing is better as a follow-up PR imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, yes. #406
src/IO/HDF5/HDF5IOHandler.cpp
Outdated
[HDF5] Warning: HDF5 dataset has chunked layout. Since the openPMD API does not | ||
support writing chunked HDF5 datasets, the whole dataset will be | ||
returned as one chunk. | ||
Some positions in the chunk may hold undefined data. | ||
)END" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this warning.
A user reads a file from elsewhere with openPMD-api and we say: "we found chunks, but we don't tell you, since we cannot write chunks." That makes no sense to me.
I am also not sure about this statement:
Some positions in the chunk may hold undefined data.
where did you find that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this warning.
A user reads a file from elsewhere with openPMD-api and we say: "we found chunks, but we don't tell you, since we cannot write chunks." That makes no sense to me.
Simple reason: I can't really test an implementation of that easily at the moment, since I would need to create a chunked dataset for that, which the openPMD API doesn't currently do. If the openPMD API doesn't support writing chunks yet, it's fair that reading them is unsupported as well?
Some positions in the chunk may hold undefined data.
Undefined as in "not defined explicitly by the writer". In ADIOS, if the writer writes data only to a subset of the available positions in the dataset, ::availableChunks
will only return positions in the dataset that have been written to. In HDF5, no such checks are done. Returning the whole dataset as one chunks means that some positions in the chunk may not hold sensible data if the dataset has been written partially only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a confusion on what HDF5 chunks actually are - please see explanation above. They are not as flexible/useful as ADIOS blocks/PGs, but rather a low-level storage detail, i.e. for compression and (in size) extensible data sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a confusion on what HDF5 chunks actually are
They are certainly different in meaning from chunks in ADIOS, which should be noted somewhere a user will see it.
* Explicitly convert things to bool, so Catch doesn't get the splendid | ||
* idea to print the Chunk struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it need for that, an ostream operator<<
implementation for Chunk? Could be generally useful for that type:
https://github.com/catchorg/Catch2/blob/devel/docs/tostring.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find the failing builds any more, think they got lost in rebasing. It looked more like a linker bug in Catch that only came up in MSVC. g++ and clang had no issues with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's try to remove this work-around then and let us compare the vars directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, comparing the members of Chunk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this should probably read:
REQUIRE(
table[ 0 ] == WrittenChunkInfo( { 0, 0 }, { height, 4 } ) );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I had originally, and I've re-run it to trigger the same error again: MSVC doesn't like that.
SerialIOTest.obj : error LNK2019: unresolved external symbol "class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const Catch::Detail::unprintableString" (?unprintableString@Detail@Catch@@3V?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@B) referenced in function "class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > __cdecl Catch::Detail::convertUnstreamable<struct openPMD::WrittenChunkInfo>(struct openPMD::WrittenChunkInfo const &)" (??$convertUnstreamable@UWrittenChunkInfo@openPMD@@@Detail@Catch@@YA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@ABUWrittenChunkInfo@openPMD@@@Z) [C:\projects\openpmd-api\build\SerialIOTests.vcxproj]
C:\projects\openpmd-api\build\bin\Release\SerialIOTests.exe : fatal error LNK1120: 1 unresolved externals [C:\projects\openpmd-api\build\SerialIOTests.vcxproj]
For some reason, MSVC won't find this symbol, and I don't really know if it's really worth hunting down the reason for this error within a 15k line file of inlined code..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed today: seems this is indeed more challenging than implementing an operator<<
, so we stay with the prior work-around.
@franzpoeschel can you please update? :)
General "Chunk" question: |
I've renamed things as |
* and gather the data items into chunks | ||
* Example dataset: | ||
* | ||
* 0123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have not yet seen a JSON file with multiple storeChunk
parts in it - can you please provide me with an example so I can follow a bit better? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm.. aren't you looking at an example? Or what would you like to see additionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's the JSON dataset that you'd like to see:
"x": {
"attributes": {
"position": {
"datatype": "VEC_DOUBLE",
"value": [
0
]
},
"unitSI": {
"datatype": "DOUBLE",
"value": 1
}
},
"data": [
[
null,
null,
null,
null
],
[
null,
null,
null,
null
],
[
2,
4,
6,
8
],
[
2,
4,
6,
8
],
[
2,
4,
6,
8
],
[
2,
4,
6,
8
],
[
2,
4,
6,
8
],
[
2,
4,
null,
null
],
[
2,
4,
null,
2
],
[
null,
null,
null,
4
]
],
"datatype": "INT"
}
So, you get nulls in unwritten positions. Nulls don't get "folded up" if a whole dimension is null, since the JSON backend takes care to fill every position in the n-d dataset at least with nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's what I am familiar with. I somehow don't get the json "chunks" implementation then. I guess there is always only one chunk? That's what I would expect at least and what seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that would be a possibility. But in JSON, it's possible to reconstruct chunks such that exactly those positions are covered that are non-null, so I figured I'd just do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it now - smart! 🎉
Ok, I think that works, although |
I can't really shake the feeling that ChunkBox sounds a bit.. double? But I think we can go with it. I like it more than ChunkSelection, mainly because a derived class WrittenChunkSelection sounds like I'm back at writing Java. Maybe ChunkMeta if that's not too generic, too? At least, that expresses directly that we're not talking about the actual chunk. |
0c92839
to
a5f028b
Compare
Implemented only for ADIOS2 so far
This PR is extracted from my streaming branch and did not need all of the things I put in it.
openPMD API does not yet write HDF5 datasets in a chunked manner, so let's not go through hoops to read them back in a special manner.
This reverts commit 2eb62c9.
7d872fd
to
1afbf5c
Compare
test/ParallelIOTest.cpp
Outdated
int __mpi_rank{ -1 }, __mpi_size{ -1 }; | ||
MPI_Comm_rank( MPI_COMM_WORLD, &__mpi_rank ); | ||
MPI_Comm_size( MPI_COMM_WORLD, &__mpi_size ); | ||
unsigned mpi_rank{ static_cast< unsigned >( __mpi_rank ) }, | ||
mpi_size{ static_cast< unsigned >( __mpi_size ) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slipped through: reserved naming.
Idea (r for read or so) - did not check all the member vars and hope it does not shadow anything this way:
int __mpi_rank{ -1 }, __mpi_size{ -1 }; | |
MPI_Comm_rank( MPI_COMM_WORLD, &__mpi_rank ); | |
MPI_Comm_size( MPI_COMM_WORLD, &__mpi_size ); | |
unsigned mpi_rank{ static_cast< unsigned >( __mpi_rank ) }, | |
mpi_size{ static_cast< unsigned >( __mpi_size ) }; | |
int r_mpi_rank{ -1 }, r_mpi_size{ -1 }; | |
MPI_Comm_rank( MPI_COMM_WORLD, &r_mpi_rank ); | |
MPI_Comm_size( MPI_COMM_WORLD, &r_mpi_size ); | |
unsigned mpi_rank{ static_cast< unsigned >( r_mpi_rank ) }, | |
mpi_size{ static_cast< unsigned >(r_mpi_size ) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that works
src/binding/python/Chunk.cpp
Outdated
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; }) | ||
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; }) | ||
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code: property_
is only needed if one wants to wrap a getter/setter method into a direct property. Super useful, but we expose the whole field directly here since it is already public in C++.
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; }) | |
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; }) | |
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the explanation. I left it commented out for now so we could check which one we would prefer :)
src/binding/python/Chunk.cpp
Outdated
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; }) | ||
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; }) | ||
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// .def_property_readonly("offset", [](Chunk & c){ return c.offset; }) | |
// .def_property_readonly("extent", [](Chunk & c){ return c.extent; }) | |
// .def_property_readonly("rank", [](Chunk & c){ return c.mpi_rank; }) |
test/SerialIOTest.cpp
Outdated
* Since the chunks are reconstructed, they need not necessarily | ||
* correspond with the way that the chunks are written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Since the chunks are reconstructed, they need not necessarily | |
* correspond with the way that the chunks are written. | |
* Since the chunks are reconstructed, they won't necessarily | |
* correspond with the way that the chunks were written. |
Looks good to me, let's not forget to rename the file(s), too. |
Thank you, great work! ✨ |
This allows users to inquire chunks withing datasets that are ready to be loaded. This helps in several situations:
Todo: