Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver-portable walker logging #5019

Merged
merged 57 commits into from
Jun 13, 2024
Merged

Driver-portable walker logging #5019

merged 57 commits into from
Jun 13, 2024

Conversation

jtkrogel
Copy link
Contributor

@jtkrogel jtkrogel commented May 31, 2024

Proposed changes

This PR introduces a simplified version of the legacy TraceManager that focuses solely on writing per-walker data to HDF5 throughout the run (no streaming/listener functionality). The new implementation, WalkerLogManager, is compatible with and implemented in both the legacy and batched drivers.

This implementation extends the original to include the ability to write full data for the highest, lowest, and median energy walkers on each rank to disk at every step. The intention of writing this abbreviated data is to enable inexpensive access to data from problematic walkers (e.g. population explosions) in production runs.

Capabilities

  • Writing per-walker scalar data (local energy, etc)
  • (optional) writing per-walker particle data (positions, spins, gradient, laplacian)
  • (optional) writing full data for min/max/median energy walkers

Below is an example generated from a short DMC run of diamond on 6 MPI ranks showing the energies of all walkers (blue), highest energy walkers (green), lowest energy walkers (red), median energy walkers (black), vs the mean energy from dmc.dat (magenta line):

walker_traces

Design

  • Main classes: WalkerLogManager, WalkerLogCollector, WalkerLogBuffer
  • WalkerLogManager
    • Driver-level resource
    • Responsible for HDF file I/O
    • Finds min/max/median energy walker data
    • Writes walker data buffers from each crowd to HDF at the end of each block
  • WalkerLogCollector
    • Crowd-level resource
    • Collects per-walker data during MC steps and stores it in local data buffers
  • WalkerLogBuffer
    • Acts as a dynamic buffer for walker data collected during a block
    • Resource allocation is first-touch

Tests

  • Legacy and batched driver deterministic tests
  • 1-mpi/1-thread and 4-mpi/4-thread versions
  • VMC and DMC
  • The tests pass if the walker logging data can be used to reconstruct the information outputted to scalar.dat (VMC) or dmc.dat (DMC)

What type(s) of changes does this code introduce?

  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Laptop

Checklist

  • Yes. This PR adds tests to cover any new code, or to catch a bug that is being fixed

Path out of WIP

  • Manual entry
  • In-line documentation
  • Batched driver naming (Peter)
  • Input section ownership at driver-level (Peter)
  • WalkerLogCollector::collect() folded into Crowd (Peter)
  • Use of const in WalkerLogCollector::collect (Ye)
  • Naming conventions/rideTheCamel (Ye)
  • WalkerLogBuffer and WalkerLogCollector in separate h/cpp files (Ye)
  • Vacuous class for types, enum for const size and index names (Ye)
  • WalkerLogCollector unit test, collect method (Ye)

Tasks for review round 3

  • Make step counter seem like a global data member
  • Long names for crowd member functions
  • Make more getters()
  • Inline default values for WalkerLogState
  • Try higher level api for hdf (results in nan)

Elements for reviewers to review

  • WalkerLog* classes
  • WalkerLogCollector unit test
  • Driver integration

src/QMCDrivers/DMC/DMCBatched.h Outdated Show resolved Hide resolved
//add trace information
bool allow_walker_logs = das.walkerlogs_tag == "yes" ||
(das.walkerlogs_tag == "none" && (das.new_run_type == QMCRunType::VMC || das.new_run_type == QMCRunType::DMC || das.new_run_type == QMCRunType::VMC_BATCH || das.new_run_type == QMCRunType::DMC_BATCH));
new_driver->requestWalkerLogs(allow_walker_logs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complicated control logic. Prefer not to have another switch. I would like to have the following.
if "walkerlog" appears globally, all the drivers honor it. if not, each driver honor the "walkerlog" within the driver input section.

Copy link
Contributor

@prckent prckent Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with Ye here - we use similar logic for maxcpusecs

Copy link
Contributor Author

@jtkrogel jtkrogel Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control logic protects drivers which either don't support walker logging, or shouldn't.

Traces were never enabled for optimizers, for instance. On another note, I really don't want to write code to support walker logging in all of the legacy drivers. CSVMC anyone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further comments welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current walker logging support is limited to VMC/DMC and its functionality and implementation can be completely different for a non-VMC/DMC driver. So its input is more suited for a driver input section rather than globally. If only some VMC/DMC drivers need to turn on logging, the input should be in the driver input section. Right now, we provide convenience to have the input globally and I believe it is better not to another switch to turn it off locally. To me, that is a bad design logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the other protection, setting allow_walker_logs correctly to false for unsupported drivers is what makes the WalkerLogManager/WalkerLogCollector operations a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then default value of allow_walker_logs is already false to all the unsupported drivers. Why do we even need to set it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much details. I would like to discuss the user-facing input logic first and then we can address the implementation issue, maybe at a later PR.

Are you OK with the following input logic

  1. if the global <walkerlog/> exists, turn on logging in all the supported drivers.
  2. when the global doesn't exist, legacy drivers do nothing. batched drivers will look for driver scope <walkerlog/> and act on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will all be simpler once drivers actually parse their own input. Right now driver input parsing is handled in QMCDriverFactory. I suggest we focus on that problem and this one will sort itself out.

src/QMCDrivers/Crowd.h Outdated Show resolved Hide resolved
src/QMCDrivers/Crowd.h Outdated Show resolved Hide resolved
src/QMCDrivers/DMC/DMCBatched.cpp Show resolved Hide resolved
@ye-luo
Copy link
Contributor

ye-luo commented Jun 11, 2024

I have merged in develop and fixed cmake issues. Please pull before making further changes.

@jtkrogel
Copy link
Contributor Author

I've addressed Peter's comment from #5035 requesting simplification of WalkerLogState.

@jtkrogel
Copy link
Contributor Author

I've also tried Peter's suggestion to use f.writeEntry(buffer, "data") in WalkerLogBuffer::writeHDF. Making that replacement resulted in NaN's being written to the HDF files (picked up by integration test). Therefore I've kept the original implementation.

@jtkrogel
Copy link
Contributor Author

Comments addressed. Ready to proceed.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 11, 2024

I've also tried Peter's suggestion to use f.writeEntry(buffer, "data") in WalkerLogBuffer::writeHDF. Making that replacement resulted in NaN's being written to the HDF files (picked up by integration test). Therefore I've kept the original implementation.

You need h5d_append and we don't have a high-level API in hdf_archive right now. added issue #5038. WalkerLogBuffer::appendHDF can be a better name to avoid confusion.

ye-luo
ye-luo previously approved these changes Jun 11, 2024
Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still quite a few fixes are needed but it is not necessary to hold the PR.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 11, 2024

Test this please

Copy link
Contributor

@PDoakORNL PDoakORNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to hold some sort of line to model good simulation object life cycle control and input handling for the batched code.

@@ -268,7 +268,6 @@ void OperatorBase::deleteTraceQuantities()
have_required_traces_ = false;
request_.reset();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be in the change set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an unecessary file to the PR file list for white space change has never been allowed. I have had to fix such changes many times. Finding these sort of typos takes effort by reviewers and they stack up over time.

src/Estimators/WalkerTraceBuffer.h Outdated Show resolved Hide resolved
src/Estimators/WalkerTraceBuffer.h Outdated Show resolved Hide resolved
src/QMCDrivers/Crowd.h Outdated Show resolved Hide resolved
@@ -75,6 +75,9 @@ class QMCMain : public MPIObjectBase, public QMCAppBase
///xml mcwalkerset read-in elements
std::vector<xmlNodePtr> walker_set_in_;

///walkerlogs xml
xmlNodePtr walker_logs_xml_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do something like

std::optional<WalkerLogInput> walker_log_input_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like storing xml, parsing the xml in-place and store the result seems cleaner. We can make some improvements with a later PR.

@@ -304,6 +309,12 @@ bool VMCBatched::run()
IndexType num_blocks = qmcdriver_input_.get_max_blocks();
//start the main estimator
estimator_manager_->startDriverRun();
//start walker log manager
wlog_manager_ = std::make_unique<WalkerLogManager>(walker_logs_input, allow_walker_logs, get_root_name(), myComm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please construct this in the QMCDriverNew constructor as EstimatorManagerNew is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be constructed in the constructor. I have moved this line to process() in #5039 at the moment. When the global input doesn't exist, we may look it up locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Process is also an API element that only exists for legacy compatibility, I'd be very suprised if construction actually needed to delayed until then, but lets discuss later.

/// whether to allow walker logs
bool allow_walker_logs;
/// walker logs input
WalkerLogInput walker_logs_input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag plus optional input object optional unique_pointer_ too much optional stuff. You can either check the presence of the object or the input, the allow is unecessary.

@@ -241,6 +249,10 @@ class QMCDriverNew : public QMCDriverInterface, public MPIObjectBase
void putTraces(xmlNodePtr txml) override {}
void requestTraces(bool allow_traces) override {}

void putWalkerLogs(xmlNodePtr wlxml) override;

void requestWalkerLogs(bool allow_walker_logs_) override { allow_walker_logs = allow_walker_logs_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be API elements required by legacy, they shouldn't do anything for batched and run counter to clean construction. This should be documented here and and note to remove with the legacy drivers added.

}

void Crowd::stopBlock() { estimator_manager_crowd_.stopBlock(); }

void Crowd::collectStepWalkerLog(int current_step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logWalkers(int current_step)
Seems to say everything necessary.

@@ -429,6 +432,14 @@ bool DMCBatched::run()
IndexType num_blocks = qmcdriver_input_.get_max_blocks();

estimator_manager_->startDriverRun();

//start walker log manager
wlog_manager_ = std::make_unique<WalkerLogManager>(walker_logs_input, allow_walker_logs, get_root_name(), myComm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the log manager being a run scope object is promising but then it should not be a member variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We may get a chance to make it a run scope object after restructuring certain bits. Will give it a try once #5039 got merged.

@jtkrogel
Copy link
Contributor Author

Hi guys. I had thought this was going in given Ye's approval.

If there really is a short list of "must have" items still remaining for this PR after all these passes, please make a short, final list below and I will address them.

I need to get past this point and get back to other work, some of which depends on this code being in place.

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t feel there are blocking issues although many improvements are still needed.

@PDoakORNL
Copy link
Contributor

Reverting the whitespace OperatorBase.cpp change to reduce the number files is the only must have.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 13, 2024

Test this please

@ye-luo ye-luo dismissed prckent’s stale review June 13, 2024 18:50

The requested renaming has been completed.

@ye-luo ye-luo enabled auto-merge June 13, 2024 18:51
@ye-luo ye-luo merged commit a1cdeb2 into QMCPACK:develop Jun 13, 2024
38 checks passed
@jtkrogel jtkrogel deleted the walker_traces branch November 19, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants