Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest cleanup of geometry handling in the CPU version #198

Open
mondracek opened this issue Sep 11, 2023 · 26 comments
Open

Suggest cleanup of geometry handling in the CPU version #198

mondracek opened this issue Sep 11, 2023 · 26 comments
Assignees

Comments

@mondracek
Copy link
Collaborator

mondracek commented Sep 11, 2023

As I mentioned in my comment to #52, the way the CLI version of our code handles the geometry parameters (params['gridA'], ... params['gridN'], lvec, nDim etc.) could use some cleanup. For starters, I propose we discuss this issue:

  1. What do we expect to be the role of gridA, gridB, gridC, and gridN parameters in the code? Do we want them to be just an optional input (in case the geometry is not specified by other means, in particular in a *.xsf or other input file)? In such a case, they can be safely forgotten (I do not mean literally deleted, just ignored) after the initialization stage. The actual lattice vectors and grid size should be consistently carried in variables like lvec and nDim throughout the actual computation. Alternatively, do we want parms['gridA'], ... params'['gridN'] to serve as a global variable? Then, we should consistently work with those. In the most extreme version of this choice, we would not even need to introduce lvec and nDim variables at all. I can somehow understand that a compromise might be desirable: Perhaps we want some functions to be also usable as stand-alone entities, without being called from the main scripts. A moderately experienced user might want to do something in an interactive Python session. The loadGeometry function is one of those we may want to be able to call this way. Than, it makes sense for such a function to take lvec, 'nDim` etc. as arguments and store them as its local variables. On the other hand, some other functions may be intended as mere subprograms of the main executable scripts and need access to global variables, since passing everything they need through arguments to them would be inefficient and would generate too long argument lists. I suspect this compromise was what @ProkopHapala had in mind when originally writing the code. Prokop, is that so? I am actually in favor of the compromise solution, given that we decide very carefully which functions should be the independent entities, taking geometry parameters as their arguments and perhaps returning them as their values, which functions should in contrast access the global variables, and why the chosen option is needed in every case. Because if we are not clear about this, we inevitably end up with the mess we currently have.
  2. As a second point, I would like to make sure that the code reads the origin of the unit cell from the *.xsf and *.cube files every time, stores it and handles accordingly. That is, the code should not rely on the origin being zero. We might even introduce an optional parameter params['gridOrigin'] to params.ini, which would be used with *.xyz files. That should solve the problem with negative coordinates of atoms in non-periodic systems, which is currently effectively forbidden.
@mondracek
Copy link
Collaborator Author

I failed to reproduce what I thought was the most severe issue that motivated this proposal for a thorough cleanup with regards to how the geometry is being handled. I was under the impression that sometimes, the grid/lattice parameters read from params.ini are taken into account even though they should have been overwritten with a *.cube input. I do not know what happened then, probably some error in one of my *.cube files that disappeared after I had recalculated everything, but now it looks as though the problem was not with the ppafm code. Anyway, there are still several open issues related to geometry in one way or another, like the recent #201, #200, #199, the older #190 and #6, or even #52, though I have created a pull request associated with the last one recently. I would still prefer forcing some consistent logic onto the geometry handling along with solving those issues rather then just close those with some ad hoc patches that would somehow work until they'd bite us to our backs with any new modification of the code.

@mondracek
Copy link
Collaborator Author

I began reviewing the code and now it seems to me that what I will advocate with respect to the geometry will need not be any big revolution after all. I have looked into core.py and I think I gather the logic followed there. The primary source of information on lattice vectors is typically a argument called cell, an analogy of lvec from HighLevel.py, coommon.py or io.py, but the global parameters , params['gridA'], params['gridB'], params['gridC'] may be used as a backup source if cell is undefined. I believe this is a good way to do things, so I will try to keep this logic consistently.
I have a question about the function setFF, defined here:

def setFF( cell=None, gridF=None, gridE=None ):

Do we even need it? I mean, it seems to be used only in some obscure examples in examples/_bak/. Do we want to have these as a part of the ppafm package? If not, I suggest we remove those examples along with the function setFF. Otherwise, if we want to keep it, I suggest the following small cleanup: setFF seems to reach for the global grid parameters twice if cell is None; first directly in its body and second time (but this never happens because by then cell has been already defined by the first case) through setFF_shape which in turn calls setGridCell. I would remove the test of cell is None and the subsequent assignment of gridA, gridB, gridC from setFF , relying on this being done in setGridCell.
Moreover, if we use gridA, gridB, and gridC as the default for the lattice vectors in core.py, I believe we should also use gridN as the default for the grid size there (in particular, probably in the setGridN function). This enhancement should in fact not change the behavior of our code at this moment, because the flow of the code is now such that the core.py functions are always called for some pre-established FF array and learn the grid from the shape of this array. That is fine and should remain that way, but I still think the setGridN should check the n argument for being None and use gridN as the fallback in such a case, just to make the code more robust.

@mondracek
Copy link
Collaborator Author

Just to clarify how I expect the job related to this issue would be done: I will do it, of course, but only after hearing the opinion of others.

@mondracek mondracek self-assigned this Sep 13, 2023
@mondracek
Copy link
Collaborator Author

Another candidate for deletion: wrapAtomsCell in common.py. Seems to be used nowhere. The function it implements, as far as I can tell, is this: shift positions of atoms by [da + 10, db + 10], then fold them back to the first unit cell. Might be useful perhaps, but at the very least, if we are going to keep it and use it, the artificial and mysterious shift by 10 should be removed.

@ProkopHapala
Copy link
Collaborator

@mondracek

I suspect this compromise was what @ProkopHapala had in mind when originally writing the code.

yes, it is true I was aiming for such compromise. But it is also true some structures in the code are just for historical reasons (to minimize effor of rewritining parts of code which are already there)

  • In general I think it is reasonable to consider lvec and nDim to be main storage of information about the lattice. gridA, gridB, gridC are mostly just for user input in input.ini (so that each line in the input-file has some name)
  • I think some trubles with molecular geometry and lattice whould be reduced if we define class which pack them togheter (including the origin of the lattice). I know people were suggesting to use ASE but to me it seems overkill (introduce dependnece on huge 3rd party package into core of ppafm - that I don't like). I would rather make something like class AtomicSystem which I have in my other project (it encapsutlates reading and loading, keeps the lattice vectros, originn, atomic types, positions, parameters, etc.). Than we can pass instances of that class into fuctions (rather than passing arrays indepenendtly), which simplyfy the interface.
    • but I'm not sure if to do it now, since it is quite large reoganization of the project

I would still prefer forcing some consistent logic onto the geometry handling along with solving those issues rather then just close those with some ad hoc patches that would somehow work until they'd bite us to our backs with any new modification of the code.

I'm very much pro. If you have ideas pleas go ahead with proposing them. I'm very open to it. I'm just not enough systematic myself, and also currently it is not my priority.

ad setFF

what you meanit is not used? OK, maybe setFF() is not used, but setFF_shape() and setFF_pointer() are used a lot. Without them you cannot share forcefield between python and C++ part.

So yes, this is also perhaps more historical. Firts I wrote setFF(), then I realized that often I need more fine-controle (to set shape and data of the grid forcefield independnetly). So I factored out setFF_shape() and setFF_pointer(). So yes, perhaps now setFF() is almost not used.

ad wrapAtomsCell - I don't really remember what I used it for. Perhaps I used it just once for some debugging.

@mondracek
Copy link
Collaborator Author

@ProkopHapala, thanks for the reply. As for the setFF() function, yeah, I mean setFF_shape() and setFF_Fpointer() or setFF_Epointer() are called directly whenever one might want to call setFF(). So I suggest we either delete the definition of setFF() or we actually use it where appropriate. Perhaps we can unite setFF() and prepareArrays() into one function and use that one. Now, as I understand it, both functions are doing similar jobs but

  1. prepareArrays() does not call setFF_shape(). We always call setFF_shape() shortly after prepareArrays(), however
  2. setFF() calls setFF_shape() but, in contrast to prepareArrays(), it cannot create a new array if the corresponding aray argument is None

@mondracek
Copy link
Collaborator Author

Now I see that the C functions setGridN and setGridCell are defined in both ppafm/cpp/GridUtils.cpp and ppafm/cpp/ProbeParticle.cpp. The definitions are identical so the duplicity should not cause any problem until we happen to change one of them, but still, it does not feel correct.

@ProkopHapala
Copy link
Collaborator

Now I see that the C functions setGridN and setGridCell are defined in both ppafm/cpp/GridUtils.cpp and ppafm/cpp/ProbeParticle.cpp. The definitions are identical so the duplicity should not cause any problem until we happen to change one of them, but still, it does not feel correct.

I agree that it is not nice from the point of code structure (modularity etc.). But the point is that GridUtils and ProbeParticle both compiles into independnet binary (.so) and they does not depend on each other.

But yes, it would makes sense to factor out these duplicite things into singe header file (.h). But I'm not sure how it works if the function is in extern C{} (if it is not problem to have extern { in header file ?)

@ProkopHapala
Copy link
Collaborator

  1. prepareArrays() does not call setFF_shape(). We always call setFF_shape() shortly after prepareArrays(), however
  2. setFF() calls setFF_shape() but, in contrast to prepareArrays(), it cannot create a new array if the corresponding aray argument is None

OK, to be honest I don't remember exactly the code structure and the rational nor the historical path which lead to this. If you find it nicer that way, feel free to change it as you say

@mondracek
Copy link
Collaborator Author

Good. Concerning the duplicity of setGridN() and setGridCell() function, I'd rather leave it as it is for now because I am also not sure how exactly the C linking works and I might actually break it by trying to repair it. I guess moving the duplicate into a header should work in principle, although the header files might not be exactly intended for such purpose, because including a header file into a *.cpp file through #include should be (as long as I understand it) done by the preprocessor essentially by pasting the plain text of the *.h file into the *cpp file before the compilation. But I am not quite sure. If anyone of you other guys knows how to handle this in C properly, you are welcome to propose a solution.
Otherwise, I see now that the cleanup I wanted to make will not require as radical changes of the code structure as I thought when I started this issue. So I will rather pick up more specific tasks concerning the handling of geometry (and force field grids) and I will do the cleaning along the way.
One of those specific issues I am planing to address would be that we introduce the proper handling of the origin as recorded in *.xsf and *.cube files. See the second bulleted point in my first comment in this thread. Do you agree that this should be done? Other such issues are those referenced in my second comment here (#6, #190, #199, #200, #52) .

@NikoOinonen
Copy link
Collaborator

@ProkopHapala

I think some trubles with molecular geometry and lattice whould be reduced if we define class which pack them togheter (including the origin of the lattice). I know people were suggesting to use ASE but to me it seems overkill (introduce dependnece on huge 3rd party package into core of ppafm - that I don't like). I would rather make something like class AtomicSystem which I have in my other project (it encapsutlates reading and loading, keeps the lattice vectros, originn, atomic types, positions, parameters, etc.). Than we can pass instances of that class into fuctions (rather than passing arrays indepenendtly), which simplyfy the interface.

I'm also very much in support for packing things into a class object. And I already did something in this direction in the OCL API with the DataGrid (except without the atoms):

ppafm/ppafm/ocl/field.py

Lines 102 to 112 in 4e04d47

class DataGrid:
'''
Class for holding data on a grid. The data can be stored on the CPU host or an OpenCL device.
Arguments:
array: np.ndarray or pyopencl.Buffer. Array values on a 3D grid with possibly multiple components.
lvec: array-like of shape (4, 3). Unit cell boundaries. First (row) vector specifies the origin,
and the remaining three vectors specify the edge vectors of the unit cell.
shape: array-like of length 3 or 4. Grid shape when array is a pyopencl.Buffer.
ctx: pyopencl.Context. OpenCL context for device buffer. Defaults to oclu.ctx.
'''

It could be nice to somehow interface this with the CPU part, maybe make the DataGrid a subclass of the whatever class we create for the core code, to avoid repeated code and maybe help somehow unify the two parts.

As for ASE, I think it would be a pretty good option for handling things. I get that too many dependencies is not desirable, but ASE is a pretty well established and stable package, and probably many people who use ppafm also already have ASE in their environment. Also, I already used ASE as an optional dependency in the GUI for the geometry viewer.

Making our own solution is fine too, but it's a lot more work.

@ProkopHapala
Copy link
Collaborator

ProkopHapala commented Oct 5, 2023

I'm also very much in support for packing things into a class object. And I already did something in this direction in the OCL API with the DataGrid (except without the atoms), It could be nice to somehow interface this with the CPU part, maybe make the DataGrid a subclass of the whatever class we create for the core code

Yes, I like your DataGrid class. Maybe we should discuss more what ideas you have about class hierarchy or relations? You mention CPU vs GPU. But I would perhpas first speak about geometry (i.e. atomc coordinates and types) vs grid (3d array) vs lattice vectors.

  1. One option is to make one dynamic class, which have optional fields (e.g. initialized to None). e.g. it may or may not have data grid, it may or may not have atoms, or lattice vectors.
  2. Another option would be to make hierarchy. E.g. starting from Atoms class, then add lattice vectors AtomsPBC and then ad grid (array) ... or in other order?
  3. Another option is to make 2-3 independnet classes e.g. (i) Atoms which strores atomic positions and types (ii) Grid which stores lattice vectors and optional data array. And them pack them together in e.g. AtomGrid class

I'm not sure which is best approach. Because python in very dynamic unlike C++, It woudl maybe slightly prefer (1) i.e. to have optional dynamic fields. The disadvantage is that it would require to put everywhere checks if some field is initialized, like if( lvec is None ):

I have to say, that one thing I hate about python is that due to dynamic initialization of class properties, it is often not clear what properties the class is supposed to have when reading the code (at least comparing to C++). This was actually main reason why I considered classes in python poinless for quite long time. Do you know some good practicies how to make the declaration of class properties in python more explicit?

Also, I already used ASE as an optional dependency in the GUI for the geometry viewer.

Don't understand me wrong. I'm very much pro integration with ASE as optional dependency. I also use ASE and I like it. But I don't like the idea to depend on it in such core functionality as loading and manegind atomic coordinates (i.e. not being able to run ppafm at all without ASE)

@NikoOinonen
Copy link
Collaborator

  1. One option is to make one dynamic class, which have optional fields (e.g. initialized to None). e.g. it may or may not have data grid, it may or may not have atoms, or lattice vectors.
  2. Another option would be to make hierarchy. E.g. starting from Atoms class, then add lattice vectors AtomsPBC and then ad grid (array) ... or in other order?
  3. Another option is to make 2-3 independnet classes e.g. (i) Atoms which strores atomic positions and types (ii) Grid which stores lattice vectors and optional data array. And them pack them together in e.g. AtomGrid class

I would avoid deep class hierarchies (2). They become very rigid and difficult to change when new needs arise, and it's often not clear at which level of the hierarchy some piece of data should exist.

So I would opt for composition instead (1, 3). I see 1 and 3 being structurally the same. E.g. either we store the lvec and grid as np.arrays or some custom classes of our own, but in any case they exist as properties of the storing class, and that structure can be changed internally flexibly.

I have to say, that one thing I hate about python is that due to dynamic initialization of class properties, it is often not clear what properties the class is supposed to have when reading the code (at least comparing to C++). This was actually main reason why I considered classes in python poinless for quite long time. Do you know some good practicies how to make the declaration of class properties in python more explicit?

One way to do this is to make all of the internal variables private (prefix with underscore _) and then make an interface by using the @property decorator for class methods. For example, in the DataGrid I store the host numpy array like this:

self._array = array

but it can also sometimes be None because it exists on the GPU instead. And then I create a property method that returns the array after checking that it's fine:

ppafm/ppafm/ocl/field.py

Lines 152 to 158 in 4e04d47

@property
def array(self):
'''Host array as np.ndarray. If the grid currently only exists on the device, it is copied to the host memory.'''
if self._array is None:
self._array = np.empty(self.shape, dtype=np.float32)
cl.enqueue_copy(oclu.queue, self._array, self._cl_array)
return self._array

The @property decorator makes it so that you can access the method as if it's just an attribute, e.g.

array = datagrid_instance.array

With the doc string in the method, it also makes for nice documentation, so the API is clear: https://ppafm.readthedocs.io/en/latest/ppafm.ocl.html#ppafm.ocl.field.DataGrid.array

@NikoOinonen
Copy link
Collaborator

I'm not sure which is best approach. Because python in very dynamic unlike C++, It woudl maybe slightly prefer (1) i.e. to have optional dynamic fields. The disadvantage is that it would require to put everywhere checks if some field is initialized, like if( lvec is None ):

We can probably make this simpler for ourselves by insisting that some of the fields always exist. The lvec we can for example read from the default parameters if it's not set, as Martin suggested above.

I think this consideration also encourages splitting the functionality into multiple classes (option 3.), because then if you only need the grid, for example, then you can just use the Grid class instead of the AtomGrid.

@mondracek
Copy link
Collaborator Author

  • I noticed that the use of classes for storing geometry has been already suggested by @ProkopHapala before, as issue Make class to store Geometry and Grid data #97. Perhaps we can close that issue noting that the discussion has been transferred here.
  • I have been recently trying to enable reading data from XSF files created by the FHI-AIMS code. There were some issues specific to FHI-AIMS that prevented loading them with the loadXSFGeom and loadXSF functions from ppafm/io.py as these function stand now. I will explain more when I start a dedicated issue about that. What I want to say for now: Working with the XSF files, I noticed that the vectors that define the grid can be in general something completely different from the lattice vectors that define the periodicity. In our code, we essentially work with the grid vectors and when we happen to need the lattice vectors, we pretend they are the same. I am not saying we need to change this approach right now, but we should keep in mind that we might need to differentiate between the grid vectors and lattice vectors in the future. So when rewriting the code, introducing new classes for the geometry etc., we should do it in a way that at least does not make future splitting of lattice vectors from grid vectors more difficult than necessary.

@ProkopHapala
Copy link
Collaborator

we pretend they are the same. I am not saying we need to change this approach right now, but we should keep in mind that we might need to differentiate between the grid vectors and lattice vectors in the future.

Not sure I understand.

  • Do you mean that there are situations (systems) where it would be usefull to distinguies the two ?
  • Or that in the code there is no checking mechanism which makes sure grid vectros and lattice vectros are the same (while they should be) ?

@mondracek
Copy link
Collaborator Author

@ProkopHapala: Both, sort of.

  • There may be cases, admittedly not particularly common, in which the distinction might be useful. For example, @NikoOinen mentioned to me when reviewing my PR Generate automatic lattice vectors in loadGeometry #216 that the user may be interested in a relatively small scanning region that does not cover the whole system. Then, it is enough if the grid includes just the scanning region plus some padding. This will more likely happen for large non-periodic systems but such scenario is not entirely excluded for periodic systems either. If atom-centered potentials like LJ are going to be used to calculate the force field and if the requested scanning region is situated at or near the periodic cell boundary, the lattice vectors of the true periodicity cannot be ignored even though the grid size should be smaller than the unit cell.
  • More often, the grid should cover exactly one periodic unit cell. For example, we rely on this when the scanning region crosses the unit cell boundary and the tip position rolls from one unit cell to another. But we do not check whether the grid axes defined in XSF files really agree with the lattice vectors in the same file, even if we expect that they do.
  • In files like OutFz.xsf, the grid axis in fact should not be the same as the lattice vectors. The grid axes determine the scanning region there and that is usually different from the system's periodic unit cell. At this moment, we simply do not write the lattice vectors into OutFz.xsf at all. That does not matter as long as we consider these OutF?.xsf files just a temporary storage of intermediate data produced by ppafm-relaxed-scan before they are passed to ppafm-plot-results. The ppafm-plot-results script does not need the information on lattice vectors. However, I do not like this approach in which the files like Outfz.xsf are seen as something internal to the ppafm code only. Why bother splitting the code into several scripts unless the output from each of the script can stand on its own? Once the user wishes to view and analyse this force fields separately, it would be desirable that the XSF files include information on the position of atoms and on the lattice vectors. We have in fact done the same for the FF*.xsf files already.

By now, I have started working on the problem of properly defining the grid origin, as needed to solve #223. On that occasion, I noticed there are parameters called FFgrid0, FFgridA, FFgridB, and FFgridC, the purpose of which is exactly to define grid geometry different from periodic lattice vectors, but only for the OCL version (see the discussion under #139). I do not really understand how the OCL version works, but can we perhaps adapt the solution used there for the CLI version too?

@ProkopHapala
Copy link
Collaborator

@ProkopHapala

the user may be interested in a relatively small scanning region that does not cover the whole system.

I see, make sens

Why bother splitting the code into several scripts unless the output from each of the script can stand on its own? Once the user wishes to view and analyse this force fields separately, it would be desirable that the XSF files include information on the position of atoms and on the lattice vectors.

I agree, I was also considering .xsf files as intermediate result which allows easy debugging and eventually some way to analyze the results. ... Maybe I just don't understand the proper format of .xsf files. I did not saw any problem there. If there is such problem it make sense to correct it.

@mondracek
Copy link
Collaborator Author

mondracek commented Nov 3, 2023

@ProkopHapala: As for the problems with the XSF format:

  • The missing atoms and lattice vectors are more an inconvenience rather than an error. Well, actually, some XSF viewers do require the atoms to be present; we have a placeholder hydrogen atom in the XSF file for that purpose. Would be better to have the real atomic structure of the sample there.
  • More closer to being a real bug is the blind adding of extra points on the boundary when saving the XSF file in our code and removing them when reading these files. This is okay internally for our code as it always adds those points when saving and always removes them when reading, so everything compensates. When reading XSF files generated by an external code, those are potentials or electron densities, typically generated by VASP for a periodic system. In such cases, the extra points are really present too, so it is okay that they get removed. But in OutFz.xsf and the like, they are usually nonsense.

Let me explain here how (to my understanding) the logic with the extra points on the grid cell boundary is supposed to work in the XSF files. First, if the length of a grid vector is a=(ax, ay, az) and there are Na points along that vector, the elementary grid step along that vector is da=(ax, ay, az)/(Na-1). This is always true. It is part of the definition of the XSF format and it is non-obvious, you can imagine a different format that would have Na instead of Na-1 in the denominator of that definition. For a general grid the dimensions of which have nothing to do with the periodic unit cell, this is end of the story. However, there is a little problem when we want the grid to cover exactly the unit cell. Imagine that a=(ax, ay, az) is now the lattice vector. We want to sample the cell by Na points along this vector with the grid points being (of course) equally spaced and, for starters, they are supposed to be unique, they do not repeat themselves because of the periodicity. The spacing between the points should of course be da=(ax, ay, az)/Na under this condition. So what should the corresponding grid vector be? Right, considering the formula that defines the relation between the grid vector and the grid spacing, it has to be a' = a*(Na-1)/Na

But this is inelegant, we would prefer to have a' = a and this is what most codes (but not FHI-AIMS!) generating XSF files for periodic structures chose to enforce. The price to pay is adding a last point along every axis that is an identical copy of the first point by the lattice periodicity. So now, we have Na +1 points instead of Na and the last point is kind of superfluous in terms of information content, but the lattice vectors and grid vectors are the same, which makes us happy, because we can immediately see (looking at the XSF file header) that the grid indeed covers the whole unit cell. So, to sum up

  • When writing the XSF files, the extra points should be added only if the grid covers exactly the entire periodic unit cell, otherwise, they make no sense and are actually wrong (because why should the last point be identical to the first one when you extend a grid that does not fit the periodicity of the structure?).
  • Similarly, when reading the XSF files, the end points should be discarded as the extra points only if the grid dimensions coincide with the lattice vectors. Now, the lattice vectors should be read right from the XSF file and compared to the grid vectors read from the same file (the XSF file contains both, if properly written, unlike the CUBE file).

@mondracek
Copy link
Collaborator Author

@ProkopHapala Is there any good reason why the order of indices for the grid array is FF[iz,iy,ix] rather than FF[ix,iy,iz] in the CLI version? I used to think this is needed for the CPP parts of the code to be efficient but now as I look deeper into it, that does not seem to be the case.

@ProkopHapala
Copy link
Collaborator

Is there any good reason why the order of indices for the grid array is FF[iz,iy,ix] rather than FF[ix,iy,iz] in the CLI version? I used to think this is needed for the CPP parts of the code to be efficient but now as I look deeper into it, that does not seem to be the case.

I think efficiency-wise it is rather opposite (we compute each z-stroke independently => better if z is the fastest axes in the array )

I think the reason for FF[iz,iy,ix] ordering was perhaps plotting ? (z-slices), but it is not very good motivation (matplotlib can plot FF[:,:,iz] as well as FF[iz,:,:] ). Maybe it is somehow useful in Fz->df conversion (but perhaps not), or for saving the data to .xsf ? I don't know. Eventually we may change the order ... but it would require changes on many places in the code ... co unless there is good reason .... ?

But

@mondracek
Copy link
Collaborator Author

@ProkopHapala, thanks for the explanation with respect to the array ordering. As for the Fz->df conversion and saving *.xsf files, I happen to know something about that stuff and no, the reversed index order is not needed for it. In the Fz to df conversion, it does not matter, the conversion is in the end just a convolution of Fz with the appropriate kernel along the z axis and we would literally only have to change axis=0 to axis=2 as an argument to an np.apply_along_axis function at one place in the code in order to adapt the Fz2df function to the natural ordering. In reading and saving *.xsf files, the reversed ordering actually complicates stuff because there is a difference between the CLI and OCL version in the array ordering so the loadXSF and loadCUBE functions need a xyz_order argument which tells them whether they should transpose the array or not. Now I see that saveXSF does not distinguish between CLI and OCL versions but rather always assumes the reverse order. I did not notice this last fact before. I do not know how it works in the OCL version, perhaps the ordering is really not consistent, arrays which we need to read to XSF are ordered one way, arrays we need to save to XSF the other way and we hope there is nothing we want to read, modify without doing the transposition and then save. I was about to offer to volunteer in changing the ordering so that it would always be FF[ix,iy,iz] but I thought I would not require touching the OCL part. Now that I see the issue affects both CLI and OCL versions, I am not sure I want to do it.

@NikoOinonen
Copy link
Collaborator

@mondracek As far as I understood, the zyx ordering in the IO functions was mostly because that is how the XSF file format is defined. I specifically added those xyz_order arguments to the functions so that I could get the arrays in the xyz order to use in the OCL code.

It should not be a big change. I think the only places in the OCL code that do IO related things are these two methods:

ppafm/ppafm/ocl/field.py

Lines 192 to 195 in 9f009ae

def from_file(cls, file_path, scale=1.0):
"""
Load grid data and atoms from a .cube or .xsf file.

ppafm/ppafm/ocl/field.py

Lines 227 to 230 in 9f009ae

def to_file(self, file_path, clamp=None):
"""
Save data grid to file(s).

The to_file method in particular transposes the array before saving.

@mondracek
Copy link
Collaborator Author

@NikoOinonen, thanks for the explanation. So @ProkopHapala was right after all that the likely original motivation of the zyx order in the CLI code was reading and saving XSF files. The native ordering in the XSF file is the opposite of what we'd need, either in CLI or OCL. Seems that historically, the approach in CLI was to accept that ordering and interpret the array indices as ZYX while transpose was performed for OCL. Okay, so we would not have to touch the OCL version at all in order to introduce consistent usage of the XYZ order but thorough changes would be needed in CLI. When done, the I/O functions would not have to differentiate between CLI and OCL and the CLI code would be a bit easier to understand but there is a big danger of introducing errors into the CLI code while implementing the change. Though the errors should be easy to spot by testing, a wrong data order on the grid would destroy the resulting figures completely. What do you think, is it worth the effort? I will not push the change at this moment myself but if anyone else would like it to be done, I am still ready to implement it.

@mondracek
Copy link
Collaborator Author

mondracek commented Nov 10, 2023

As for the OutFz.xsf files I mentioned earlier in this thread, I see that the inclusion of the sample atomic structure in those files has been already implemented. I was probably looking at files produced by an old version of the code when writing my earlier comments. But the extra points at the scanning region boundary are probably still there.

@ProkopHapala
Copy link
Collaborator

What do you think, is it worth the effort?

I think it is woth it.

I will not push the change at this moment myself but if anyone else would like it to be done, I am still ready to implement it.

if you implement it, I'm pro :-)
I think it is worth it, but perhaps does not have hi priority.

Though the errors should be easy to spot by testing,

Copy that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants