Skip to content

Conversation

alberto-carta
Copy link

No description provided.

@mbercx
Copy link
Member

mbercx commented Jul 16, 2025

Thanks @alberto-carta! @bastonero as our resident Hubbard expert, can you have a look at this?

@alberto-carta I'd also like to add some tests for the new parser functionality. The setup is a bit tricky, so I will take care of it, but it would be very helpful to have a minimal example, i.e. a run for a small structure with few k-points etc, so we don't have to add a file with 100k lines. ^^

@bastonero
Copy link
Collaborator

Thanks @alberto-carta for the contribution! Some time ago we decided with @sphuber to NOT store the occupation matrices, as these are already saved in the XML file. This to avoid to duplicate data.

A better solution I think it would be to add an extra function to the PwCalculationTools, and then also to a future PwWorkChainTools (see aiidateam/aiida-core#6884). I already have a working function that we can try to improve that I used for the aiida-hubbard paper. I put the code after highlithing the advantages of this approach.

A dedicated function to PwCalculationTools (and to future WorkChainTools) would allows us to:

  1. Avoid parsing the always changing pw.x stdout.
  2. Avoid adding a huge amount of lines to the already complicated parser.
  3. Avoid storing unnecessary data that is already saved in the XML.
  4. Have a more robust implementation that accounts for any shape of the occupation matrices.
  5. Faster and more accurate occupation matrices with all the necessary digits (say to do LR-cDFT).
def get_occupations(node) -> list:
    """Return the occupations for a PwCalculation/PwBaseWorkChain node."""
    import numpy as np
    from xmlschema import XMLSchema
    from xml.etree import ElementTree
    from aiida_quantumespresso.parsers.parse_xml.versions import get_schema_filepath

    with node.outputs.retrieved.base.repository.open('data-file-schema.xml') as xml_file:
        xml_parsed = ElementTree.parse(xml_file)

    schema_filepath = get_schema_filepath(xml_parsed)
    # schema_filepath = "/home/bastonero/.conda/envs/aiida/lib/python3.9/site-packages/aiida_quantumespresso/parsers/parse_xml/schemas/qes_220603.xsd"
    xsd = XMLSchema(schema_filepath)
    xml_dictionary = xsd.to_dict(xml_parsed, validation='skip')

    # --- Play around here and re-arrange data as you wish
    ns_matrices, species, indices, spins = [], [], [], []

    for hubbard_ns in xml_dictionary['output']['dft']['dftU']['Hubbard_ns']:
        shape = hubbard_ns['@dims']
        order = hubbard_ns['@order']
        ns = np.array(hubbard_ns['$'], order=order).reshape(shape)
        ns_matrices.append(ns)
        species.append(hubbard_ns['@specie'])
        indices.append(hubbard_ns['@index']-1)
        spins.append(hubbard_ns['@spin'])

    return np.array(ns_matrices) # we should agree here how to give it in output

So ideally we would like to implement this in the PwCalculationTools, and have an API that looks like the following to account for some use-cases:

pw_calc_node.tools.get_occupations()
pw_calc_node.tools.get_occupations(kind_name='Fe1')
pw_calc_node.tools.get_occupations(symbol='Fe')
pw_calc_node.tools.get_occupations(atom_index=0)
pw_calc_node.tools.get_occupations(atom_index=0, spin=1)

Eigenvectors can be obtained simply by diagonalizing "on-the-fly" the small matrices, using e.g. numpy:

import numpy as np
occupations = pw_calc_node.tools.get_occupations(atom_index=0, spin=1)
eigenvectors = np.linalg.diag(occupations)

We should also agree on how to return the matrices. Probably it would be great to devise a new class to facilitate the usage and put it in common.hubbard or utils.hubbard.

I won't have time for several months to work on this, but I strongly recommend that we use this approach rather than parsing the stdout. Happy if somebody wants to go ahead with my recommendations, and happy to discuss!

@alberto-carta
Copy link
Author

Hey @bastonero, thanks for the thorough review and the proposed alternative! I understand the design philosophy behind avoiding data duplication and the benefits of using a PwCalculationTools function for a more robust and flexible solution.

However, I'd like to highlight a couple of points regarding the current implementation and user expectations. My PR extends an existing feature (which for me was broken? maybe because it was based on the previous DFT+U output) that requires users to explicitly enable the parsing of these occupation matrices. Given this explicit user action, it's reasonable for them to expect these values to be readily available in the output node, just as they were before.

These occupation matrices are a crucial observable in Hubbard calculations. Directly exposing them in the output node, perhaps under output_atomic_occupations (as is done now) or nested within output_parameters (I'd personally lean towards the former for clarity), to me aligns with the current workflow for users who specifically request this data.

I totally agree about the part of moving this parser to the XML reader instead of the raw file. I am basing a plugin of mine on my fork's modifications, as soon as I bring it out of the testing phase I can rework it with that in mind.

Once we agree on the optimal format for these occupation matrices/dictionaries, I'd still advocate for them to be exposed directly in the output node (on request), making them easily accessible and discoverable.

pw_calc_node.tools.get_occupations()
pw_calc_node.tools.get_occupations(kind_name='Fe1')
pw_calc_node.tools.get_occupations(symbol='Fe')
pw_calc_node.tools.get_occupations(atom_index=0)
pw_calc_node.tools.get_occupations(atom_index=0, spin=1)

I appreciate the proposal for PwCalculationTools.get_occupations(), and I see its potential for flexibility. However, as a user, this approach for accessing fundamental outputs doesn't seem simple at all to me.

Will other common observables, like total energy, site magnetization, or site charge, also be retrieved exclusively through similar tools functions in the future? To me, key results, especially those explicitly requested by the user, should be readily accessible directly from the calculation's output nodes for simplicity and consistency.

We can avoid to store redundant stuff like eigenvectors and eigenvalues if you really want, but the latter for instance are crucial at a quick glance to see if one has orbital order in a system. Also the trace of the matrix is an important quantity.

I'm happy to discuss this further.

@alberto-carta
Copy link
Author

alberto-carta commented Jul 16, 2025

Here's the comment you asked for, formatted for GitHub Markdown:

So for instance, this now returns a big dictionary where each entry looks like this:

{
"2": {
"magnetic_moment": 0.58571,
"occupations": {
"down": 3.93578,
"total": 8.45726,
"up": 4.52148
},
"spin_data": {
"down": {
"eigenvalues": [0.035, 0.968, 0.968, 0.982, 0.982],
"eigenvectors": [
[-0.0, 0.976, -0.216, 0.004, -0.001],
[-0.577, -0.002, -0.003, 0.492, 0.651],
[-0.577, -0.001, 0.003, 0.318, -0.752],
[-0.0, 0.216, 0.976, 0.001, 0.004],
[0.577, -0.003, 0.001, 0.81, -0.101]
],
"occupation_matrix": [
[0.968, 0.0, 0.0, -0.0, 0.0],
[0.0, 0.667, -0.316, 0.0, 0.316],
[0.0, -0.316, 0.667, -0.0, 0.316],
[-0.0, 0.0, -0.0, 0.968, 0.0],
[0.316, 0.316, 0.0, 0.667]
]
},
"up": {
"eigenvalues": [0.78, 0.78, 0.985, 0.985, 0.991],
"eigenvectors": [
[0.963, -0.233, -0.133, 0.003, -0.0],
[0.075, 0.079, 0.386, -0.711, 0.577],
[0.031, -0.104, 0.423, 0.69, 0.577],
[0.233, 0.963, 0.003, 0.133, 0.0],
[0.105, -0.026, 0.809, -0.021, -0.577]
],
"occupation_matrix": [
[0.784, -0.011, -0.011, -0.0, -0.022],
[-0.011, 0.985, 0.003, -0.019, -0.003],
[-0.011, 0.003, 0.985, 0.019, -0.003],
[-0.0, -0.019, 0.019, 0.784, -0.0],
[-0.022, -0.003, -0.003, -0.0, 0.985]
]
}
}
}
}

This mimics the QE structure of the std.out, where "2" at the beginning refers to "ATOM 2".

Can you specify how you want it? For instance, if the atom is Nickel, should it be something like this?

{
"Ni2": {
"magnetic_moment": 0.58571,
"occupations": {
"down": 3.93578,
"total": 8.45726,
"up": 4.52148
},
"occupation_matrices": {
"down": {
"occupation_matrix": [
[0.968, 0.0, 0.0, -0.0, 0.0],
[0.0, 0.667, -0.316, 0.0, 0.316],
[0.0, -0.316, 0.667, -0.0, 0.316],
[-0.0, 0.0, -0.0, 0.968, 0.0],
[0.0, 0.316, 0.316, 0.0, 0.667]
]
},
"up": {
"occupation_matrix": [
[0.784, -0.011, -0.011, -0.0, -0.022],
[-0.011, 0.985, 0.003, -0.019, -0.003],
[-0.011, 0.003, 0.985, 0.019, -0.003],
[-0.0, -0.019, 0.019, 0.784, -0.0],
[-0.022, -0.003, -0.003, -0.0, 0.985]
]
}
}
}
}

In the case of SOC (Spin-Orbit Coupling), you will also have an off-diagonal down-up part.

@bastonero
Copy link
Collaborator

Many thanks for the comments!

I agree that the functionality is broken and we should totally restore it. The problem with extending it is that the format of the outputs is a Dict object. This will store information directly in the databae, and will pollute and slow down queries in AiiDA. Therefore, my suggestion would be to keep it simple as it was before, avoiding to add extra information and think it through. For sure, we should totally not use the stdout, as major effort was done also on QE side to store the occupations in the XML file.

In passing, indices should be referred to python indices (i.e. related to StructureData), not fortran indices. I don't know if the old implementation was keeping fortran indices, but that would be bad as we should keep consistency with python and AiiDA, not with QE (see e.g. all the Hubbard implementation).

However, as a user, this approach for accessing fundamental outputs doesn't seem simple at all to me.

Could you be more specific? Maybe you mean "intuitive"? Length wise is as long as accessing a usual output. Ideally, this will be more powerfull and useful than returning a dictionary that than one needs to understand how to use with respect to a simple function (or more) that already does the job.

Of course, this should be accompanied with suitable documentation.

On the one hand, I understand your point on having the outputs ready where one should expect them. On the other hand, I think many other information that QE spits out are not "properly" stored and showed in the .outputs. So, I think we should balance whether it is worth it to duplicate the data.
For instance, I understand why one should also (re)store the trajectory, comprising energy, forces, ect, as you may want to query for it. So probably the good question is: do we want to query for this piece of data?

Probably yes, but i would still avoid the way it is extended at the moment in this PR and devise e.g. a new Data class where we store the data as json file and dump it in the repository. This will also allow us to get access to occupations in a much better way. Like:

pw_calc_node.outputs.occupations.asdict()
pw_calc_node.outputs.occupations.get_occupations(kind_name='Fe')
pw_calc_node.outputs.occupations.get_eigenvalues(kind_name='Fe')
pw_calc_node.outputs.occupations.get_eigenvectors(kind_name='Fe')
...

@mbercx
Copy link
Member

mbercx commented Sep 15, 2025

@bastonero @alberto-carta what's the status on this? Any way I can help get this merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants