diff --git a/examples/neural_quantum_states/README.md b/examples/neural_quantum_states/README.md
new file mode 100644
index 0000000..fb4f2c3
--- /dev/null
+++ b/examples/neural_quantum_states/README.md
@@ -0,0 +1,50 @@
+# AUTOREGRESSIVE NEURAL QUANTUM STATES FOR QUANTUM CHEMISTRY
+
+This respository contains code jointly developed between the University of Michigan and SandboxAQ to implement the retentive network (RetNet) neural quantum states ansatz outlined in the paper, "Retentive Nueral Quantum States: Efficient Ansatze for Ab Initio Quantum Chemistry," by Oliver Knitter, Dan Zhao, James Stokes, Martin Ganahl, Stefan Leichenauer, and Shravan Veerapaneni.
+
+Preprint available on the arXiv: https://arxiv.org/abs/2411.03900
+
+Corresponding Author: Oliver Knitter, knitter@umich.edu
+
+This repository is based off of the code Tianchen Zhao released (https://github.com/Ericolony/made-qchem) alongside the paper "Scalable neural quantum states architecture for quantum chemistry" (https://arxiv.org/abs/2208.05637). This new code uses neural quantum states (NQS), implemented in PyTorch to calculate electronic ground state energies for second quantized molecular Hamiltonians, which are calculated using PySCF and Tangelo (https://github.com/sandbox-quantum/Tangelo/). The RetNet ansatz implementation was made using the yet-another-retnet repository (https://github.com/fkodom/yet-another-retnet). Other ansatze available are the MADE and Transformer ansatze, which are implemented natively in PyTorch. The Hamiltonian expectation value estimates are calculated following the procedure outlined in Zhao et al.'s paper,  but the modular structure of the code allows for relatively simple plug-and-play implementations of different models and Hamiltonian expectation estimate calculators.
+
+## Usage
+
+### 1. Environment Setup
+
+- The environment requirements are avaialable in the nqs-qchem.yml file.
+
+- Make sure your operating system meets the requirements for PyTorch 2.5.1 and CUDA 11.8.
+
+- If setting up the environment through the .yml file does not work, then the primary dependencies are obtained as follows:
+
+    ```
+    conda install python==3.9.18
+    conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
+    conda install tensorboardX
+    conda install pytorch-model-summary
+    conda install yacs
+    pip install yet-another-retnet
+    pip install tangelo-gc
+    pip install --prefer-binary pyscf
+    ```
+
+### 2. Running the Program
+
+To view a list of previously logged molecule names, without running the main program, run
+
+```
+python -m main --list_molecules
+```
+
+If a desired molecule name is not stored by the program, then running the program with the desired name and a corresponding pubchem ID will link the name and ID in the program's internal name storage. The list of orbital basis sets is available within the PySCF documentation. The program itself can be run by executing the following script:
+
+```
+./run.sh
+```
+
+The run script will adapt to the number of GPUs specified for use, but it does so in a naive way. The script presumes that the total number of GPUs used by the program equals the total number of system GPUs available on the current machine, and will call on all those GPUs to run the program. It is the user's responsibility to ensure these parameters are adequate, and to modify them if necessary (to run the program using multiple nodes, for instance).
+Input configuration flags are generally recorded in .yaml files within the directory ./exp_configs/ and are called as needed in the run script. Some configurations are expected to be specified within the run script: hypothetically all configurations may be specified in the run script, which supersedes the .yaml file, but this is not the expected form of use.
+
+### 3. Logging Outputs
+All logged results and outputs produced by the program can be found under './logger/'. Each run of the program updates './logger/results.txt' (which is not automatically erased) with the best result obtained over the specified number of trials. Each run of the program generates its own output directory within './logger/' that stores a copy of the .yaml file containing the input configurations that were used (those not specified by the file are included in the name of the directory). A dictionary containing basic logging information for all trials is saved as 'result.npy', and tensorboard files containing identical information are saved as well. The model corresponding with the best score of all the trials is also saved here. It is possible to specify different save locations for all these data using existing input configurations.
diff --git a/examples/neural_quantum_states/__init__.py b/examples/neural_quantum_states/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/examples/neural_quantum_states/__pycache__/config.cpython-312.pyc b/examples/neural_quantum_states/__pycache__/config.cpython-312.pyc
new file mode 100644
index 0000000..e9110ff
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/config.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/__pycache__/config.cpython-38.pyc b/examples/neural_quantum_states/__pycache__/config.cpython-38.pyc
new file mode 100644
index 0000000..f7964ab
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/config.cpython-38.pyc differ
diff --git a/examples/neural_quantum_states/__pycache__/config.cpython-39.pyc b/examples/neural_quantum_states/__pycache__/config.cpython-39.pyc
new file mode 100644
index 0000000..08a9696
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/config.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/__pycache__/main.cpython-310.pyc b/examples/neural_quantum_states/__pycache__/main.cpython-310.pyc
new file mode 100644
index 0000000..605d034
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/main.cpython-310.pyc differ
diff --git a/examples/neural_quantum_states/__pycache__/main.cpython-312.pyc b/examples/neural_quantum_states/__pycache__/main.cpython-312.pyc
new file mode 100644
index 0000000..478f410
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/main.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/__pycache__/main.cpython-38.pyc b/examples/neural_quantum_states/__pycache__/main.cpython-38.pyc
new file mode 100644
index 0000000..1debd94
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/main.cpython-38.pyc differ
diff --git a/examples/neural_quantum_states/__pycache__/main.cpython-39.pyc b/examples/neural_quantum_states/__pycache__/main.cpython-39.pyc
new file mode 100644
index 0000000..e17dd24
Binary files /dev/null and b/examples/neural_quantum_states/__pycache__/main.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/config.py b/examples/neural_quantum_states/config.py
new file mode 100644
index 0000000..384fa56
--- /dev/null
+++ b/examples/neural_quantum_states/config.py
@@ -0,0 +1,119 @@
+from yacs.config import CfgNode as CN
+# yacs official github page https://github.com/rbgirshick/yacs
+
+_C = CN()
+''' Miscellaneous '''
+_C.MISC = CN()
+# Random seed
+_C.MISC.RANDOM_SEED = 0
+# Logger path
+_C.MISC.DIR = ''
+# Saved models path
+_C.MISC.SAVE_DIR = ''
+# Number of trials
+_C.MISC.NUM_TRIALS = 0
+# DDP seeds
+_C.MISC.SAME_LOCAL_SEED = False
+
+''' Training hyper-parameters '''
+_C.TRAIN = CN()
+# Learning rate
+_C.TRAIN.LEARNING_RATE = 0.0
+# Optimizer weight decay
+_C.TRAIN.WEIGHT_DECAY = 0.0
+# Training optimizer name
+# Available choices: ['adam', 'sgd', 'adadelta', 'adamax', 'adagrad', 'nadam', 'radam', 'adamw']
+_C.TRAIN.OPTIMIZER_NAME = ''
+# Learning rate scheduler name
+# Available choices: ['decay', 'cyclic', 'trap', 'const', 'cosine', 'cosine_warm']
+_C.TRAIN.SCHEDULER_NAME = ''
+# Training batch size
+_C.TRAIN.BATCH_SIZE = 0
+# Number of training epochs
+_C.TRAIN.NUM_EPOCHS = 0
+# How many iterations between recalculation of Hamiltonian energy
+_C.TRAIN.RETEST_INTERVAL = 1
+# Whether to use entropy regularization
+_C.TRAIN.ANNEALING_COEFFICIENT = 0.0
+# Entropy regularization constant schedule type
+_C.TRAIN.ANNEALING_SCHEDULER = 'none'
+# Largest number of samples that each GPU is expected to process at one time
+_C.TRAIN.MINIBATCH_SIZE = 1
+
+''' Model hyper-parameters '''
+_C.MODEL = CN()
+# The name of the model
+# Available choices: ['made', 'transformer', 'retnet']
+_C.MODEL.MODEL_NAME = ''
+# The number of hidden layers in MADE/Phase MLP
+_C.MODEL.MADE_DEPTH = 1
+# Hidden layer size in MADE/Phase MLP
+_C.MODEL.MADE_WIDTH = 64
+# Embedding/Hidden State Dimension for Transformer/RetNet
+_C.MODEL.EMBEDDING_DIM = 32
+# Number of Attention/Retention Heads per Transformer/RetNet layer
+_C.MODEL.ATTENTION_HEADS = 4
+# Feedforward Dimension for Transformer/RetNet
+_C.MODEL.FEEDFORWARD_DIM = 512
+# Number of Transformer/RetNet layers
+_C.MODEL.TRANSFORMER_LAYERS = 1
+# Parameter std initialization
+_C.MODEL.INIT_STD = 0.1
+# Model temperature parameter
+_C.MODEL.TEMPERATURE = 1.0
+
+''' Data hyper-parameters '''
+_C.DATA = CN()
+# Minimum nonunique batch size for autoregressive sampling
+_C.DATA.MIN_NUM_SAMPLES = 1e2
+# Maximum nonunique batch size for autoregressive sampling
+_C.DATA.MAX_NUM_SAMPLES = 1e12
+# Molecule name
+_C.DATA.MOLECULE = ''
+# Pubchem molecular compound CID (only needed if name does not exist in shelf)
+_C.DATA.MOLECULE_CID = 0
+# Basis ['STO-3G', '3-21G', '6-31G', '6-311G*', '6-311+G*', '6-311++G**', '6-311++G(2df,2pd)', '6-311++G(3df,3pd)', 'cc-pVDZ', 'cc-pVDZ-DK', 'cc-pVTZ', 'cc-pVQZ', 'aug-cc-pCVQZ']
+_C.DATA.BASIS = ''
+# Prepare FCI result; not recommended for medium-to-large systems
+_C.DATA.FCI = False
+# Choice of Hamiltonian to use for training: ['adaptive_shadows', 'exact', 'sample']
+_C.DATA.HAMILTONIAN_CHOICE = 'exact'
+# Sample batch size for Hamiltonian sampling
+_C.DATA.HAMILTONIAN_BATCH_SIZE = 10
+# Number of unique samples in Hamiltonian sampling
+_C.DATA.HAMILTONIAN_NUM_UNIQS = 500
+# Probability of resampling estimated Hamiltonian
+_C.DATA.HAMILTONIAN_RESET_PROB = 0.01
+# Number of batches that the flipped input states are split into during local energy calculation
+_C.DATA.HAMILTONIAN_FLIP_BATCHES = 1
+
+''' Evaluation hyper-parameters '''
+_C.EVAL = CN()
+# Loading path of the saved model
+_C.EVAL.MODEL_LOAD_PATH = ''
+# Name of the results logger
+_C.EVAL.RESULT_LOGGER_NAME = './results/results.txt'
+
+''' DistributedDataParallel '''
+_C.DDP = CN()
+# Node number globally
+_C.DDP.NODE_IDX = 0
+# This needs to be explicitly passed in
+_C.DDP.LOCAL_WORLD_SIZE = 1
+# Total number of GPUs
+_C.DDP.WORLD_SIZE = 1
+# Master address for communication
+_C.DDP.MASTER_ADDR = 'localhost'
+# Master port for communication
+_C.DDP.MASTER_PORT = 12355
+
+
+def get_cfg_defaults():
+    """Get a yacs CfgNode object with default values for the project."""
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    return _C.clone()
+
+# Alternatively, provide a way to import the defaults as
+# a global singleton:
+# cfg = _C  # users can `from config import cfg`
diff --git a/examples/neural_quantum_states/exp_configs/nqs.yaml b/examples/neural_quantum_states/exp_configs/nqs.yaml
new file mode 100644
index 0000000..d9af66f
--- /dev/null
+++ b/examples/neural_quantum_states/exp_configs/nqs.yaml
@@ -0,0 +1,54 @@
+
+MISC:
+  RANDOM_SEED: 667
+  DIR: ''
+  SAVE_DIR: ''
+  NUM_TRIALS: 1
+  SAME_LOCAL_SEED: 'True'
+
+DDP:
+  NODE_IDX: -1
+  LOCAL_WORLD_SIZE: -1
+  WORLD_SIZE: -1
+  MASTER_ADDR: 'localhost'
+  MASTER_PORT: 12355
+
+TRAIN:
+  OPTIMIZER_NAME: 'adamw'
+  SCHEDULER_NAME: 'cosine_warmup'
+  LEARNING_RATE: 0.0025
+  WEIGHT_DECAY: 0.05
+  NUM_EPOCHS: 100
+  RETEST_INTERVAL: 10
+  BATCH_SIZE: 2000
+  MINIBATCH_SIZE: 2000
+  ANNEALING_COEFFICIENT: 1.0
+  ANNEALING_SCHEDULER: 'quartic'
+  
+MODEL:
+  MODEL_NAME: ''
+  MADE_DEPTH: 2
+  MADE_WIDTH: 64
+  EMBEDDING_DIM: 16
+  ATTENTION_HEADS: 2
+  FEEDFORWARD_DIM: 64
+  TRANSFORMER_LAYERS: 1
+  INIT_STD: 0.1
+  TEMPERATURE: 1.0
+
+DATA:
+  MIN_NUM_SAMPLES: 1e2
+  MAX_NUM_SAMPLES: 1e12
+  MOLECULE: 'H2'
+  MOLECULE_CID: 0
+  BASIS: 'sto-3g'
+  FCI: True
+  HAMILTONIAN_CHOICE: 'exact'
+  HAMILTONIAN_FLIP_BATCHES: 8
+  HAMILTONIAN_BATCH_SIZE: 10000
+  HAMILTONIAN_NUM_UNIQS: 1000
+  HAMILTONIAN_RESET_PROB: 0.01
+
+EVAL:
+  MODEL_LOAD_PATH: ''
+  RESULT_LOGGER_NAME: './logger/results.txt'
diff --git a/examples/neural_quantum_states/main.py b/examples/neural_quantum_states/main.py
new file mode 100644
index 0000000..45f2739
--- /dev/null
+++ b/examples/neural_quantum_states/main.py
@@ -0,0 +1,149 @@
+import os
+import argparse
+import shelve
+import logging
+import numpy as np
+import torch.multiprocessing as mp
+
+from config import get_cfg_defaults  # local variable usage pattern
+from src.util import prepare_dirs, set_seed, write_file, folder_name_generator, setup, cleanup
+from src.training.train import train
+
+def run_trials(rank: int, cfg: argparse.Namespace):
+    '''
+    Runs the train function for cfg.MISC.NUM_TRIALS times with identical input configurations (except for the random seed, which is altered).
+    Args:
+        rank: Identifying index of GPU running process
+        cfg: Input configurations
+    Returns:
+        None
+    '''
+    local_rank = rank
+    # set up configurations
+    directory = cfg.MISC.DIR
+    molecule_name = cfg.DATA.MOLECULE
+    num_trials = cfg.MISC.NUM_TRIALS
+    random_seed = cfg.MISC.RANDOM_SEED
+    result_logger_name = cfg.EVAL.RESULT_LOGGER_NAME
+    node_idx = cfg.DDP.NODE_IDX
+    local_world_size = cfg.DDP.LOCAL_WORLD_SIZE
+    world_size = cfg.DDP.WORLD_SIZE
+    global_rank = node_idx * local_world_size + local_rank
+    master_addr = cfg.DDP.MASTER_ADDR
+    master_port = cfg.DDP.MASTER_PORT
+    use_same_local_seed = cfg.MISC.SAME_LOCAL_SEED
+    logging.info(f"Running DDP on rank {global_rank}.")
+    if world_size > 1:
+        setup(global_rank, world_size, master_addr, master_port)
+    if global_rank == 0:
+        prepare_dirs(cfg)
+    best_score = float('inf')*np.ones(3)
+    avg_time_elapsed = 0.0
+    result_dic = {}
+    write_file(result_logger_name, f"=============== {directory.split('/')[-1]} ===============", global_rank)
+    current_model_path = os.path.join(cfg.MISC.SAVE_DIR, 'last_model.pth')
+    best_model_path = os.path.join(cfg.MISC.SAVE_DIR, 'best_model_pth')
+    for trial in range(num_trials):
+        seed = random_seed + trial * 1000
+        # set random seeds
+        set_seed(seed + (0 if use_same_local_seed else global_rank))
+        new_score, time_elapsed, dic = train(cfg, local_rank, global_rank)
+        new_score = np.array(new_score)
+        result_log = f"[{molecule_name}] Score {new_score}, Time elapsed {time_elapsed:.4f}"
+        write_file(result_logger_name, f"Trial - {trial+1}", global_rank)
+        write_file(result_logger_name, result_log, global_rank)
+        if new_score[0] < best_score[0]:
+            best_score = new_score
+            if global_rank == 0:
+                os.system(f'mv "{current_model_path}" "{best_model_path}"')
+        avg_time_elapsed += time_elapsed / num_trials
+        if dic is not None:
+            for key in dic:
+                if key in result_dic:
+                    result_dic[key] = np.concatenate((result_dic[key], np.expand_dims(dic[key], axis=0)), axis=0)
+                else:
+                    result_dic[key] = np.expand_dims(dic[key], axis=0)
+    result_log = f"[{directory.split('/')[-1]}][{molecule_name}] Best Score {best_score}, Time elapsed {avg_time_elapsed:.4f}, over {num_trials} trials"
+    write_file(result_logger_name, result_log, global_rank)
+    if global_rank == 0:
+        np.save(os.path.join(directory, 'result.npy'), result_dic)
+    if world_size > 1:
+        cleanup()
+
+
+def main(cfg: argparse.Namespace):
+    '''
+    Main function for NQS program
+    Args:
+        cfg: input configurations
+    Returns:
+        None
+    '''
+    world_size = cfg.DDP.WORLD_SIZE
+    local_world_size = cfg.DDP.LOCAL_WORLD_SIZE
+    if world_size > 1:
+        mp.spawn(run_trials, args=(cfg,), nprocs=local_world_size, join=True)
+    else:
+        run_trials(0, cfg)
+    logging.info('--------------- Finished ---------------')
+
+
+if __name__ == '__main__':
+    # command line arguments
+    parser = argparse.ArgumentParser(description="Command-Line Options")
+    parser.add_argument(
+        "--config_file",
+        default="",
+        metavar="FILE",
+        help="Path to the yaml config file",
+        type=str,
+    )
+    parser.add_argument(
+        "opts",
+        help="Modify config options using the command-line",
+        default=None,
+        nargs=argparse.REMAINDER,
+    )
+    parser.add_argument(
+        '--list_molecules',
+        action='store_true',
+        default=False,
+        help='List saved molecules (instead of running program)',
+    )
+    args = parser.parse_args()
+    # Remainder of program does not run if program is asked to list molecules
+    if args.list_molecules:
+        molecule_list = 'PubChem IDs stored by program:'
+        with shelve.open('./src/data/pubchem_IDs/ID_data') as pubchem_ids:
+            for name, id in pubchem_ids.items():
+                molecule_list+=('\n{}: {}'.format(name, id))
+        print(molecule_list+'\nStored names can be used as input arguments to access corresponding IDs.')
+    else:
+        # configurations
+        cfg = get_cfg_defaults()
+        cfg.merge_from_file(args.config_file)
+        cfg.merge_from_list(args.opts)
+        # set up directories (cfg.MISC.DIR)
+        log_folder_name = folder_name_generator(cfg, args.opts)
+        if cfg.MISC.DIR == '':
+            cfg.MISC.DIR = './logger/{}'.format(log_folder_name)
+        if cfg.MISC.SAVE_DIR == '':
+            cfg.MISC.SAVE_DIR = './logger/{}'.format(log_folder_name)
+        os.makedirs(cfg.MISC.DIR, exist_ok=True)
+        os.makedirs(cfg.MISC.SAVE_DIR, exist_ok=True)
+        os.system(f'cp "{args.config_file}" "{cfg.MISC.DIR}"')
+        # freeze the configurations
+        cfg.freeze()
+        # set logger
+        logging.root.handlers = []
+        logging.basicConfig(
+            level=logging.INFO,
+            format="%(asctime)s [%(levelname)s] %(message)s",
+            handlers=[
+                logging.FileHandler(os.path.join(cfg.MISC.DIR, 'logging.log')),
+                logging.StreamHandler()
+            ]
+        )
+        # run program
+        main(cfg)
+    
diff --git a/examples/neural_quantum_states/nqs-qchem.yml b/examples/neural_quantum_states/nqs-qchem.yml
new file mode 100644
index 0000000..c8cfba3
--- /dev/null
+++ b/examples/neural_quantum_states/nqs-qchem.yml
@@ -0,0 +1,190 @@
+name: nqs-qchem
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=conda_forge
+  - _openmp_mutex=4.5=2_gnu
+  - aom=3.6.1=h59595ed_0
+  - blas=1.0=mkl
+  - brotli-python=1.1.0=py39hf88036b_2
+  - bzip2=1.0.8=h4bc722e_7
+  - ca-certificates=2024.12.14=hbcca054_0
+  - certifi=2024.12.14=pyhd8ed1ab_0
+  - cffi=1.17.1=py39h15c3d72_0
+  - charset-normalizer=3.4.0=pyhd8ed1ab_1
+  - colorama=0.4.6=pyhd8ed1ab_1
+  - cuda-cudart=11.8.89=0
+  - cuda-cupti=11.8.87=0
+  - cuda-libraries=11.8.0=0
+  - cuda-nvrtc=11.8.89=0
+  - cuda-nvtx=11.8.86=0
+  - cuda-runtime=11.8.0=0
+  - cuda-version=12.6=3
+  - ffmpeg=4.4.2=gpl_hdf48244_113
+  - filelock=3.16.1=pyhd8ed1ab_1
+  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
+  - font-ttf-inconsolata=3.000=h77eed37_0
+  - font-ttf-source-code-pro=2.038=h77eed37_0
+  - font-ttf-ubuntu=0.83=h77eed37_3
+  - fontconfig=2.15.0=h7e30c49_1
+  - fonts-conda-ecosystem=1=0
+  - fonts-conda-forge=1=0
+  - freetype=2.12.1=h267a509_2
+  - gettext=0.22.5=he02047a_3
+  - gettext-tools=0.22.5=he02047a_3
+  - giflib=5.2.2=hd590300_0
+  - gmp=6.3.0=hac33072_2
+  - gmpy2=2.1.5=py39h7196dd7_3
+  - gnutls=3.7.9=hb077bed_0
+  - h2=4.1.0=pyhd8ed1ab_1
+  - hpack=4.0.0=pyhd8ed1ab_1
+  - hyperframe=6.0.1=pyhd8ed1ab_1
+  - idna=3.10=pyhd8ed1ab_1
+  - intel-openmp=2022.0.1=h06a4308_3633
+  - jinja2=3.1.4=pyhd8ed1ab_1
+  - lame=3.100=h166bdaf_1003
+  - lcms2=2.16=hb7c19ff_0
+  - ld_impl_linux-64=2.43=h712a8e2_2
+  - lerc=4.0.0=h27087fc_0
+  - libasprintf=0.22.5=he8f35ee_3
+  - libasprintf-devel=0.22.5=he8f35ee_3
+  - libblas=3.9.0=16_linux64_mkl
+  - libcblas=3.9.0=16_linux64_mkl
+  - libcublas=11.11.3.6=0
+  - libcufft=10.9.0.58=0
+  - libcufile=1.11.1.6=0
+  - libcurand=10.3.7.77=0
+  - libcusolver=11.4.1.48=0
+  - libcusparse=11.7.5.86=0
+  - libdeflate=1.23=h4ddbbb0_0
+  - libdrm=2.4.124=hb9d3cd8_0
+  - libegl=1.7.0=ha4b6fd6_2
+  - libexpat=2.6.4=h5888daf_0
+  - libffi=3.4.2=h7f98852_5
+  - libgcc=14.2.0=h77fa898_1
+  - libgcc-ng=14.2.0=h69a702a_1
+  - libgettextpo=0.22.5=he02047a_3
+  - libgettextpo-devel=0.22.5=he02047a_3
+  - libgfortran=14.2.0=h69a702a_1
+  - libgfortran5=14.2.0=hd5240d6_1
+  - libgl=1.7.0=ha4b6fd6_2
+  - libglvnd=1.7.0=ha4b6fd6_2
+  - libglx=1.7.0=ha4b6fd6_2
+  - libgomp=14.2.0=h77fa898_1
+  - libiconv=1.17=hd590300_2
+  - libidn2=2.3.7=hd590300_0
+  - libjpeg-turbo=3.0.0=hd590300_1
+  - liblapack=3.9.0=16_linux64_mkl
+  - liblzma=5.6.3=hb9d3cd8_1
+  - liblzma-devel=5.6.3=hb9d3cd8_1
+  - libnpp=11.8.0.86=0
+  - libnsl=2.0.1=hd590300_0
+  - libnvjpeg=11.9.0.86=0
+  - libopenblas=0.3.28=pthreads_h94d23a6_1
+  - libpciaccess=0.18=hd590300_0
+  - libpng=1.6.44=hadc24fc_0
+  - libsqlite=3.47.2=hee588c1_0
+  - libstdcxx=14.2.0=hc0a3c3a_1
+  - libstdcxx-ng=14.2.0=h4852527_1
+  - libtasn1=4.19.0=h166bdaf_0
+  - libtiff=4.7.0=hd9ff511_3
+  - libunistring=0.9.10=h7f98852_0
+  - libuuid=2.38.1=h0b41bf4_0
+  - libva=2.22.0=h8a09558_1
+  - libvpx=1.13.1=h59595ed_0
+  - libwebp=1.4.0=h2c329e2_0
+  - libwebp-base=1.4.0=hd590300_0
+  - libxcb=1.17.0=h8a09558_0
+  - libxcrypt=4.4.36=hd590300_1
+  - libxml2=2.13.5=h0d44e9d_1
+  - libzlib=1.3.1=hb9d3cd8_2
+  - llvm-openmp=15.0.7=h0cdce71_0
+  - markupsafe=3.0.2=py39h9399b63_1
+  - mkl=2022.1.0=hc2b9512_224
+  - mpc=1.3.1=h24ddda3_1
+  - mpfr=4.2.1=h90cbb55_3
+  - mpmath=1.3.0=pyhd8ed1ab_1
+  - ncurses=6.5=he02047a_1
+  - nettle=3.9.1=h7ab15ed_0
+  - networkx=3.2.1=pyhd8ed1ab_0
+  - openh264=2.3.1=hcb278e6_2
+  - openjpeg=2.5.3=h5fbd93e_0
+  - openssl=3.4.0=hb9d3cd8_0
+  - p11-kit=0.24.1=hc5aa10d_0
+  - packaging=24.2=pyhd8ed1ab_2
+  - pillow=11.0.0=py39h538c539_0
+  - pip=24.3.1=pyh8b19718_2
+  - protobuf=5.28.3=py39hf88036b_0
+  - pthread-stubs=0.4=hb9d3cd8_1002
+  - pycparser=2.22=pyh29332c3_1
+  - pysocks=1.7.1=pyha55dd90_7
+  - python=3.9.18=h0755675_1_cpython
+  - python_abi=3.9=5_cp39
+  - pytorch=2.5.1=py3.9_cuda11.8_cudnn9.1.0_0
+  - pytorch-cuda=11.8=h7e8668a_6
+  - pytorch-model-summary=0.1.1=py_0
+  - pytorch-mutex=1.0=cuda
+  - pyyaml=6.0.2=py39h8cd3c5a_1
+  - readline=8.2=h8228510_1
+  - requests=2.32.3=pyhd8ed1ab_1
+  - setuptools=75.6.0=pyhff2d567_1
+  - svt-av1=1.4.1=hcb278e6_0
+  - tensorboardx=2.6.2.2=pyhd8ed1ab_1
+  - tk=8.6.13=noxft_h4845f30_101
+  - torchaudio=2.5.1=py39_cu118
+  - torchtriton=3.1.0=py39
+  - torchvision=0.20.1=py39_cu118
+  - tqdm=4.67.1=pyhd8ed1ab_0
+  - typing_extensions=4.12.2=pyha770c72_1
+  - urllib3=2.2.3=pyhd8ed1ab_1
+  - wayland=1.23.1=h3e06ad9_0
+  - wayland-protocols=1.37=hd8ed1ab_0
+  - wheel=0.45.1=pyhd8ed1ab_1
+  - x264=1!164.3095=h166bdaf_2
+  - x265=3.5=h924138e_3
+  - xorg-libx11=1.8.10=h4f16b4b_1
+  - xorg-libxau=1.0.12=hb9d3cd8_0
+  - xorg-libxdmcp=1.1.5=hb9d3cd8_0
+  - xorg-libxext=1.3.6=hb9d3cd8_0
+  - xorg-libxfixes=6.0.1=hb9d3cd8_0
+  - xz=5.6.3=hbcc6ac9_1
+  - xz-gpl-tools=5.6.3=hbcc6ac9_1
+  - xz-tools=5.6.3=hb9d3cd8_1
+  - yacs=0.1.8=pyhd8ed1ab_1
+  - yaml=0.2.5=h7f98852_2
+  - zstandard=0.23.0=py39h08a7858_1
+  - zstd=1.5.6=ha6fb4c9_0
+  - pip:
+      - bitarray==3.0.0
+      - cirq-core==1.3.0
+      - contourpy==1.3.0
+      - cycler==0.12.1
+      - deprecation==2.1.0
+      - duet==0.2.9
+      - einops==0.8.0
+      - fonttools==4.55.3
+      - fsspec==2024.12.0
+      - h5py==3.12.1
+      - importlib-resources==6.4.5
+      - kiwisolver==1.4.7
+      - matplotlib==3.9.4
+      - numpy==1.26.4
+      - openfermion==1.6.1
+      - pandas==2.2.3
+      - pubchempy==1.0.4
+      - pyparsing==3.2.0
+      - pyscf==2.7.0
+      - python-dateutil==2.9.0.post0
+      - pytz==2024.2
+      - scipy==1.13.1
+      - six==1.17.0
+      - sortedcontainers==2.4.0
+      - sympy==1.13.1
+      - tangelo-gc==0.4.3
+      - tzdata==2024.2
+      - yet-another-retnet==0.5.1
+      - zipp==3.21.0
+prefix: /opt/conda/envs/nqs-qchem
diff --git a/examples/neural_quantum_states/run.sh b/examples/neural_quantum_states/run.sh
new file mode 100755
index 0000000..1153497
--- /dev/null
+++ b/examples/neural_quantum_states/run.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+MOLECULE_LIST=('H2O' 'N2')
+BASIS_LIST=('sto-3g')
+for MOLECULE in "${MOLECULE_LIST[@]}"
+do
+    for BASIS in "${BASIS_LIST[@]}"
+    do
+	LWS_VAR=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+        args=(
+            --config_file './exp_configs/nqs.yaml' DDP.WORLD_SIZE $LWS_VAR DDP.NODE_IDX 0 DDP.LOCAL_WORLD_SIZE $LWS_VAR
+	    MODEL.MODEL_NAME 'retnet' DATA.MOLECULE "${MOLECULE}" DATA.BASIS "${BASIS}" 
+        )
+	CUDA_VISIBLE_DEVICES=$(seq 0 $((LWS_VAR - 1)) | tr '\n' ',') python -m main "${args[@]}"
+    done
+done
diff --git a/examples/neural_quantum_states/src/__init__.py b/examples/neural_quantum_states/src/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/examples/neural_quantum_states/src/__pycache__/__init__.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/__init__.cpython-312.pyc
new file mode 100644
index 0000000..8fd0082
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/__init__.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/__init__.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/__init__.cpython-39.pyc
new file mode 100644
index 0000000..6b7de0f
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/__init__.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/complex.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/complex.cpython-312.pyc
new file mode 100644
index 0000000..cf41a8a
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/complex.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/complex.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/complex.cpython-39.pyc
new file mode 100644
index 0000000..01ff613
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/complex.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/data_loader.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/data_loader.cpython-312.pyc
new file mode 100644
index 0000000..273022e
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/data_loader.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/data_loader.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/data_loader.cpython-39.pyc
new file mode 100644
index 0000000..5508fc6
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/data_loader.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/evaluate.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/evaluate.cpython-39.pyc
new file mode 100644
index 0000000..4cde55a
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/evaluate.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/helper.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/helper.cpython-312.pyc
new file mode 100644
index 0000000..4ad21bb
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/helper.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/helper.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/helper.cpython-39.pyc
new file mode 100644
index 0000000..ce19ce8
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/helper.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/optimizer.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/optimizer.cpython-312.pyc
new file mode 100644
index 0000000..125564f
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/optimizer.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/optimizer.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/optimizer.cpython-39.pyc
new file mode 100644
index 0000000..3a078c0
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/optimizer.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/scheduler.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/scheduler.cpython-312.pyc
new file mode 100644
index 0000000..89d6aa1
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/scheduler.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/scheduler.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/scheduler.cpython-39.pyc
new file mode 100644
index 0000000..febb6fa
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/scheduler.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/train.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/train.cpython-39.pyc
new file mode 100644
index 0000000..fdc0b6d
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/train.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/util.cpython-312.pyc b/examples/neural_quantum_states/src/__pycache__/util.cpython-312.pyc
new file mode 100644
index 0000000..90a9bb6
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/util.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/__pycache__/util.cpython-39.pyc b/examples/neural_quantum_states/src/__pycache__/util.cpython-39.pyc
new file mode 100644
index 0000000..8016643
Binary files /dev/null and b/examples/neural_quantum_states/src/__pycache__/util.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/complex.py b/examples/neural_quantum_states/src/complex.py
new file mode 100644
index 0000000..1d924a6
--- /dev/null
+++ b/examples/neural_quantum_states/src/complex.py
@@ -0,0 +1,35 @@
+import numpy as np
+import torch
+from typing import Union
+
+'''
+A collection of simple functions for performing complex number arithmetic on batches of doubled-up real tensors
+'''
+# Simplify the functions by using torch's built-in indexing for the last dimension
+def real(x: Union[torch.Tensor, np.ndarray]) -> Union[torch.Tensor, np.ndarray]:
+    return x[..., 0]
+
+def imag(x: Union[torch.Tensor, np.ndarray]) -> Union[torch.Tensor, np.ndarray]:
+    return x[..., 1]
+
+def conjugate(x: Union[torch.Tensor, np.ndarray]) -> Union[torch.Tensor, np.ndarray]:
+    return torch.stack([real(x), -imag(x)], dim=-1)
+
+def norm_square(x: Union[torch.Tensor, np.ndarray]) -> Union[torch.Tensor, np.ndarray]:
+    return real(x)**2 + imag(x)**2
+
+def exp(x: Union[torch.Tensor, np.ndarray]) -> Union[torch.Tensor, np.ndarray]:
+    amp, phase = real(x).exp(), imag(x)
+    return torch.stack([amp * phase.cos(), amp * phase.sin()], dim=-1)
+
+# Simplify the code by making sure that y is a torch tensor before converting it to x's type
+def scalar_mult(x: Union[torch.Tensor, np.ndarray], y: Union[torch.Tensor, np.ndarray]) -> torch.Tensor:
+    if isinstance(x, np.ndarray):
+        x = torch.from_numpy(x)
+    if isinstance(y, np.ndarray):
+        y = torch.from_numpy(y)
+    y = y.to(x.device)
+    re = real(x) * real(y) - imag(x) * imag(y)
+    im = real(x) * imag(y) + imag(x) * real(y)
+    return torch.stack([re, im], dim=-1)
+    #return torch.stack([re, im], dim=-1) if torch.is_tensor(x) else np.stack([re, im], axis=-1)
diff --git a/examples/neural_quantum_states/src/data/__pycache__/data_loader.cpython-39.pyc b/examples/neural_quantum_states/src/data/__pycache__/data_loader.cpython-39.pyc
new file mode 100644
index 0000000..cdc7064
Binary files /dev/null and b/examples/neural_quantum_states/src/data/__pycache__/data_loader.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/data/__pycache__/read_hamiltonian.cpython-312.pyc b/examples/neural_quantum_states/src/data/__pycache__/read_hamiltonian.cpython-312.pyc
new file mode 100644
index 0000000..df68d5c
Binary files /dev/null and b/examples/neural_quantum_states/src/data/__pycache__/read_hamiltonian.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/data/__pycache__/read_hamiltonian.cpython-39.pyc b/examples/neural_quantum_states/src/data/__pycache__/read_hamiltonian.cpython-39.pyc
new file mode 100644
index 0000000..21b4d1e
Binary files /dev/null and b/examples/neural_quantum_states/src/data/__pycache__/read_hamiltonian.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/data/data_loader.py b/examples/neural_quantum_states/src/data/data_loader.py
new file mode 100644
index 0000000..ff98016
--- /dev/null
+++ b/examples/neural_quantum_states/src/data/data_loader.py
@@ -0,0 +1,43 @@
+import os
+import numpy as np
+import argparse
+from src.data.read_hamiltonian import load_molecule
+from tangelo.toolboxes.operators import count_qubits
+from tangelo import Molecule
+
+def load_hamiltonian_string(molecule_name: str, molecule_cid: int, basis: str, global_rank: int, world_size: int) -> [Molecule, int, str]:
+    """
+    Load the Hamiltonian string for a given molecule and basis set
+    Args:
+        molecule_name: name of molecule for Hamiltonian construction
+        molecule_cid: Pubchem CID for the molecule (only needed/used if name not recognized)
+        basis: name of basis set
+        global_rank: identifying index among all GPUs
+        world_size: number of GPUs
+    Returns:
+        molecule: a Tangelo Molecule object
+        num_sites: number of qubits in Hamiltonian
+        string: qubit Hamiltonian as a string
+    """
+    molecule, qubit_hamiltonian = load_molecule(molecule_name, molecule_cid, basis, global_rank, world_size)
+    num_sites = count_qubits(qubit_hamiltonian)
+    string = f"{qubit_hamiltonian}"
+    return molecule, num_sites, string
+
+
+def load_data(cfg: argparse.Namespace, global_rank: int, world_size: int) -> dict:
+    """
+    Load the data according to the given configuration (cfg).
+    Args:
+        cfg: config flags
+        global_rank: indentifying index among all GPUs
+        world_size: number of GPUs
+    Returns:
+        data: a dictionary containing the hamiltonian string, qubit/spin-orbital number, and molecule object
+    """
+    molecule_name = cfg.DATA.MOLECULE
+    molecule_cid = cfg.DATA.MOLECULE_CID
+    basis = cfg.DATA.BASIS
+    molecule, num_sites, string = load_hamiltonian_string(molecule_name, molecule_cid, basis, global_rank, world_size)
+    data = {'hamiltonian_string': string, 'num_sites': num_sites, 'molecule': molecule}
+    return data
diff --git a/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.bak b/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.bak
new file mode 100644
index 0000000..6ed0ba0
--- /dev/null
+++ b/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.bak
@@ -0,0 +1,32 @@
+'H2', (0, 6)
+'LiH', (512, 6)
+'H2O', (1024, 6)
+'CH2', (1536, 8)
+'BeH2', (2048, 8)
+'NH3', (2560, 5)
+'CH4', (3072, 6)
+'C2', (3584, 8)
+'F2', (4096, 6)
+'N2', (4608, 6)
+'O2', (5120, 6)
+'LiF', (5632, 8)
+'HCl', (6144, 6)
+'H2S', (6656, 6)
+'CH2O', (7168, 6)
+'PH3', (7680, 6)
+'LiCl', (8192, 8)
+'CH4O', (8704, 6)
+'Li2O', (9216, 8)
+'C2H4O', (9728, 6)
+'C3H6', (10240, 8)
+'C2H4O2', (10752, 5)
+'H2SO4', (11264, 6)
+'CNa2O3', (11776, 6)
+'CO2', (12288, 6)
+'Aspirin', (12800, 6)
+'VitaminC', (13312, 8)
+'Li2', (13824, 8)
+'Na2', (14336, 8)
+'C2HF3O2', (14848, 6)
+'Cr2', (15360, 8)
+'null', (15872, 5)
diff --git a/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.dat b/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.dat
new file mode 100644
index 0000000..5aecc99
Binary files /dev/null and b/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.dat differ
diff --git a/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.dir b/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.dir
new file mode 100644
index 0000000..6ed0ba0
--- /dev/null
+++ b/examples/neural_quantum_states/src/data/pubchem_IDs/ID_data.dir
@@ -0,0 +1,32 @@
+'H2', (0, 6)
+'LiH', (512, 6)
+'H2O', (1024, 6)
+'CH2', (1536, 8)
+'BeH2', (2048, 8)
+'NH3', (2560, 5)
+'CH4', (3072, 6)
+'C2', (3584, 8)
+'F2', (4096, 6)
+'N2', (4608, 6)
+'O2', (5120, 6)
+'LiF', (5632, 8)
+'HCl', (6144, 6)
+'H2S', (6656, 6)
+'CH2O', (7168, 6)
+'PH3', (7680, 6)
+'LiCl', (8192, 8)
+'CH4O', (8704, 6)
+'Li2O', (9216, 8)
+'C2H4O', (9728, 6)
+'C3H6', (10240, 8)
+'C2H4O2', (10752, 5)
+'H2SO4', (11264, 6)
+'CNa2O3', (11776, 6)
+'CO2', (12288, 6)
+'Aspirin', (12800, 6)
+'VitaminC', (13312, 8)
+'Li2', (13824, 8)
+'Na2', (14336, 8)
+'C2HF3O2', (14848, 6)
+'Cr2', (15360, 8)
+'null', (15872, 5)
diff --git a/examples/neural_quantum_states/src/data/read_hamiltonian.py b/examples/neural_quantum_states/src/data/read_hamiltonian.py
new file mode 100644
index 0000000..3a23433
--- /dev/null
+++ b/examples/neural_quantum_states/src/data/read_hamiltonian.py
@@ -0,0 +1,93 @@
+import os
+import time
+import logging
+import pickle
+import pubchempy
+import shelve
+import torch
+import torch.distributed as dist
+
+from tangelo import Molecule, SecondQuantizedMolecule
+from tangelo.toolboxes.qubit_mappings.mapping_transform import fermion_to_qubit_mapping
+from tangelo.toolboxes.operators import QubitOperator
+
+def get_molecule_geometry(name: str, molecule_cid: int) -> list:
+    '''
+    Retrieves molecular geometry from PubChem
+    Args:
+        name: name of molecule
+        molecule_cid: Pubchem CID of molecule (only used if name not recognized)
+    Returns:
+        geometry: a list of tuples describing the molecule's atomic nuclei and their spatial coordinates
+    '''
+    # PubChem IDs are stored in a shelf file - ID shelf is not comprehensive and may require updating (currently there is no automated process to add new molecule IDs)
+    with shelve.open('./src/data/pubchem_IDs/ID_data') as MOLECULE_CID:
+        if name in MOLECULE_CID.keys():
+            id = MOLECULE_CID[name]
+        else:
+            # if name is not recognized, then program associates it with provided CID 
+            id = molecule_cid
+            MOLECULE_CID[name] = id
+    geometry_flat = False
+    pubchempy_molecule = pubchempy.get_compounds(id, 'cid', record_type='3d')
+    if len(pubchempy_molecule) == 0:
+        pubchempy_molecule = pubchempy.get_compounds(id, 'cid', record_type='2d')
+        geometry_flat = True
+    pubchempy_geometry = pubchempy_molecule[0].to_dict(properties=['atoms'])['atoms']
+    if name == 'Cr2':
+        pubchempy_geometry[0]['x'] = 0
+        pubchempy_geometry[1]['x'] = 1.6788
+    elif name == 'H2':
+        pubchempy_geometry[0]['x'] = 0
+        pubchempy_geometry[1]['x'] = 0.74
+    elif name == 'LiH':
+        pass
+        pubchempy_geometry[0]['x'] = 0
+        pubchempy_geometry[1]['x'] = 1.5949
+    if not geometry_flat:
+        geometry = [(atom['element'], (atom['x'], atom['y'], atom['z'])) for atom in pubchempy_geometry]
+    else:
+        geometry = [(atom['element'], (atom['x'], atom['y'], atom.get('z', 0))) for atom in pubchempy_geometry]
+    return geometry
+
+def load_molecule(molecule_name: str, molecule_cid: int, basis: str, global_rank: int, world_size: int) -> [Molecule, QubitOperator]:
+    '''
+    Retrieves second quantized molecule Hamiltonian for a given molecule and basis set
+    Args:
+        molecule_name: name of molecule
+        molecule_cid: Pubchem CID of molecule (only used if name not recognized)
+        basis: Basis set of molecule orbitals
+        global_rank: identifying index among all GPUs
+        world_size: number of GPUs
+    Returns:
+        molecule: Tangelo SecondQuantizedMolecule object
+        qubit_hamiltonian: Second quantized molecular Hamiltonian as Tangelo QubitOperator object
+    '''
+    mult = 1 if molecule_name not in ["O2", "CH2"] else 3 #spin mulitiplicity
+    if global_rank == 0:
+        geometry = get_molecule_geometry(molecule_name, molecule_cid)
+        num_atoms = [len(geometry)]
+    else:
+        num_atoms = [1]
+    if world_size > 1:
+        dist.broadcast_object_list(num_atoms,src=0)
+    if global_rank > 0:
+        geometry = [[] for _ in range(num_atoms[0])]
+    if world_size > 1:
+        dist.broadcast_object_list(geometry, src=0)
+
+    #Solve molecule and print results
+    logging.info("Creating Tangelo SecondQuantizedMolecule Object...")
+    t_start=time.time()
+    molecule = SecondQuantizedMolecule(geometry, q=0, spin=mult-1, basis=basis, frozen_orbitals=None) #RHF/ROHF used by default
+    logging.info(f'{molecule_name} has:')
+    logging.info(f'\t\t\tbasis {basis}')
+    logging.info(f'\t\t\tgeometry {geometry},')
+    logging.info(f'\t\t\t{molecule.n_active_electrons} electrons in {molecule.n_active_mos} spin-orbitals,')
+    logging.info(f'\t\t\tHartree-Fock energy of {molecule.mf_energy:.6f} Hartree,')
+    logging.info("done in {:.2f} seconds".format(time.time()-t_start))
+    # 3. Convert molecular Hamiltonian to qubit Hamiltonian.
+    logging.info("Obtain Qubit Hamiltonian... at {:.2f} seconds".format(time.time()-t_start))
+    qubit_hamiltonian = fermion_to_qubit_mapping(molecule.fermionic_hamiltonian, "JW")
+    logging.info("done in {:.2f} seconds".format(time.time()-t_start))
+    return molecule, qubit_hamiltonian
diff --git a/examples/neural_quantum_states/src/models/__init__.py b/examples/neural_quantum_states/src/models/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/examples/neural_quantum_states/src/models/__pycache__/__init__.cpython-312.pyc b/examples/neural_quantum_states/src/models/__pycache__/__init__.cpython-312.pyc
new file mode 100644
index 0000000..5701d1d
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/__init__.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/__init__.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/__init__.cpython-39.pyc
new file mode 100644
index 0000000..df83775
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/__init__.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/base.cpython-312.pyc b/examples/neural_quantum_states/src/models/__pycache__/base.cpython-312.pyc
new file mode 100644
index 0000000..c9dc434
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/base.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/base.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/base.cpython-39.pyc
new file mode 100644
index 0000000..367b6e0
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/base.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/made.cpython-312.pyc b/examples/neural_quantum_states/src/models/__pycache__/made.cpython-312.pyc
new file mode 100644
index 0000000..9e13676
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/made.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/made.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/made.cpython-39.pyc
new file mode 100644
index 0000000..b85abe6
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/made.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/retnet.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/retnet.cpython-39.pyc
new file mode 100644
index 0000000..f05f953
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/retnet.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/retnet2.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/retnet2.cpython-39.pyc
new file mode 100644
index 0000000..8931fa4
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/retnet2.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/retnet_old.cpython-312.pyc b/examples/neural_quantum_states/src/models/__pycache__/retnet_old.cpython-312.pyc
new file mode 100644
index 0000000..8da1fd4
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/retnet_old.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/retnet_old.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/retnet_old.cpython-39.pyc
new file mode 100644
index 0000000..91b5322
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/retnet_old.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/transformer.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/transformer.cpython-39.pyc
new file mode 100644
index 0000000..9e1a84a
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/transformer.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/util.cpython-312.pyc b/examples/neural_quantum_states/src/models/__pycache__/util.cpython-312.pyc
new file mode 100644
index 0000000..f91ac48
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/util.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/models/__pycache__/util.cpython-39.pyc b/examples/neural_quantum_states/src/models/__pycache__/util.cpython-39.pyc
new file mode 100644
index 0000000..9a92af6
Binary files /dev/null and b/examples/neural_quantum_states/src/models/__pycache__/util.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/models/base.py b/examples/neural_quantum_states/src/models/base.py
new file mode 100644
index 0000000..327b3bb
--- /dev/null
+++ b/examples/neural_quantum_states/src/models/base.py
@@ -0,0 +1,164 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import re
+import logging
+from pytorch_model_summary import summary
+
+class Base(nn.Module):
+    '''
+    Base template for all autoregressive NQS ansatze.
+    Args:
+        num_sites: qubit number
+        num_spin_up: number of spin up electrons
+        num_spin_down: number of spin down electrons
+        device: Device (CPU or Cuda) to store model
+        **kwargs: nonspecific kwargs
+    '''
+    def __init__(self, num_sites: int, num_spin_up: int, num_spin_down: int, device: str, **kwargs):
+        super().__init__()
+        self.num_sites = num_sites
+        self.num_spin_up = num_spin_up
+        self.num_spin_down = num_spin_down
+        self.device = device
+
+        self.sampling = False
+        self.inference = False
+
+    def forward(self, configurations: torch.Tensor, **kwargs):
+        '''
+        Performs a forward pass through the ansatz for either training, sampling, or inference
+        Args:
+        - configurations: a batch of qubit spin configurations for either the whole system or individual orbitals
+        - **kwargs: for sampling, some models require kwargs to perform forward passes
+        Returns:
+        - prob_cond: if self.sampling, returns conditional probabilities corresponding with configurations, possibly additional internal info
+        - log_psi: if not self.sampling, returns log of stace vector entries corresponding with configurations
+        '''
+        pass
+
+    def sample(self, num_uniqs: int, num_samples: int) -> [torch.Tensor, torch.Tensor]:
+        '''
+        Samples from model state distribution
+        Args:
+        - num_uniqs: maximum number of unique samples to return after sampling completes
+        - num_samples: total number of nonunique samples obtained
+        Returns: 
+        - uniq_samples: batch of unique samples
+        - uniq_count: number of times (out of num_samples) that each unique sample was obtained
+        '''
+        pass
+
+    def save(self, model_save_path: str):
+        '''
+        Saves current model weights as .pth file
+        Args:
+        - model_save_path: path to which the model is saved
+        Returns:
+        - None
+        '''
+        logging.info("[*] Save model to {}...".format(model_save_path))
+        torch.save(self.state_dict(), model_save_path)
+        return model_save_path
+
+    def load(self, model_load_path: str, strict: bool = True) -> torch.nn.Module:
+        """
+        Load pretrained model parameters from a given path.
+        Args:
+        - model_load_path: The path of the saved model file.
+        - strict: Whether to strictly enforce that the keys in state_dict match the keys returned by model.state_dict().
+        Returns:
+        - model: The updated PyTorch model.
+        """
+        # Load the saved model from the file path
+        bad_state_dict = torch.load(model_load_path, map_location='cpu')
+        # Rename the keys in state_dict from 'module.' to '' (for loading into a non-DistributedDataParallel model)
+        correct_state_dict = {re.sub(r'^module\.', '', k): v for k, v in bad_state_dict.items()}
+        # If strict is False, only load the parameters that have the same shape as the corresponding parameters in the model
+        if not strict:
+            logging.info(f"Loading {len(correct_state_dict)} params")
+            own_state = self.state_dict()
+            final_state_dict = {}
+            for name, param in correct_state_dict.items():
+                if name not in own_state:
+                    continue
+                param = param.data
+                own_param = own_state[name].data
+                if own_param.shape == param.shape:
+                    final_state_dict[name] = param
+            correct_state_dict = final_state_dict
+            logging.info(f"Loaded {len(correct_state_dict)} params")
+        # Load the state_dict into the model
+        self.load_state_dict(correct_state_dict, strict=strict)
+        self.eval()
+        self.zero_grad()
+    
+    def state2shell(self, states: torch.Tensor) -> torch.Tensor:
+        '''
+        Useful function for converting qubit spin configurations to statevector indices: i.e. converts [|0,0>, |1,0>, |0,1>, |1,1>] to [0, 1, 2, 3]
+        Args:
+            states: qubit spin configurations
+        Returns:
+            shell: statevector indices
+        '''
+        bs = states.shape[0]
+        shell_size = 2
+        shell = (states.view(bs, -1, shell_size).clamp_min(0) * torch.Tensor([1.0, 2.0]).to(states.device)).sum(-1)
+        return shell.type(torch.int64)
+
+    def multinomial_arr(self, count: np.ndarray, p: np.ndarray) -> np.ndarray:
+        '''
+        Samples from binomial distribution of 'count' trials with success probability 'p' (as a batch)
+        Args:
+        - count: batch of count values
+        - p: batch of probabilities
+        Returns:
+        - out: batch of binomial distribution samples (as histograms)
+        '''
+        # Copy the count array to avoid modifying it in place
+        count = np.copy(count)
+        out = np.zeros_like(p, dtype=int)
+        # Compute the cumulative sums of the probabilities
+        ps = np.cumsum(p, axis=-1)
+        # Avoid division by zero and NaNs by setting the probabilities to zero where the cumulative sum is zero
+        condp = np.divide(p, ps, out=np.zeros_like(p), where=ps != 0)
+        # Iterate over the columns of p in reverse order
+        for i in range(p.shape[-1] - 1, 0, -1):
+            # Sample from a binomial distribution using the conditional probabilities
+            binsample = np.random.binomial(count, condp[..., i])
+            # Update the output array and the count array
+            out[..., i] = binsample
+            count -= binsample
+        # Assign the remaining count to the first column of the output array
+        out[..., 0] = count
+        return out
+
+
+def get_model(model_name: str, device: str, print_model_info: bool, **kwargs) -> torch.nn.Module:
+    """
+    Get a PyTorch model based on the given model name and arguments.
+    Args:
+    - model_name: Name of the model to be loaded.
+    - device: Device to be used for model inference and training.
+    - print_model_info: Whether to print the model summary.
+    - **kwargs: Other arguments specific to the model.
+    Returns:
+    - model: The loaded PyTorch model.
+    """
+    if model_name == 'made':
+        from .made import MADE
+        model = MADE(**kwargs)
+    elif model_name == 'transformer':
+        from .transformer import NNQSTransformer
+        model = NNQSTransformer(**kwargs)
+    elif model_name == 'retnet':
+        from .retnet import NNQSRetNet
+        model = NNQSRetNet(**kwargs)
+    else:
+        raise ValueError(f"Unknown model_name: {model_name}")
+    if print_model_info:
+        print(summary(model, torch.zeros(10, model.num_sites), show_input=False))
+    model.eval()
+    model.device=device
+    return model.to(device)
+
diff --git a/examples/neural_quantum_states/src/models/made.py b/examples/neural_quantum_states/src/models/made.py
new file mode 100644
index 0000000..5628533
--- /dev/null
+++ b/examples/neural_quantum_states/src/models/made.py
@@ -0,0 +1,171 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .base import Base
+
+class MADE(Base):
+    '''
+    Class implements MADE-based NQS ansatz
+    Child class specific args:
+        made_width: width of modulus and phase network hidden layers
+        made_depth: number of hidden layers in modulus and phase networks
+        temperature: Temperature variable for modulus softmax
+        **kwargs: nonspecific kwargs
+    '''
+    def __init__(self, num_sites: int, num_spin_up: int, num_spin_down: int, made_width: int=64, made_depth: int=2, temperature: float=1.0, device: str=None, **kwargs):
+        super(MADE, self).__init__(num_sites, num_spin_up, num_spin_down, device)
+        self.temperature = temperature
+        # construct model
+        self.net = []
+        self.num_in, self.num_out = num_sites, num_sites*2
+        self.hidden_sizes = [made_width] * made_depth
+        layer_sizes = [self.num_in] + self.hidden_sizes + [self.num_out]
+        for h0,h1 in zip(layer_sizes, layer_sizes[1:]):
+            self.net.extend([
+                    nn.Linear(in_features=h0, out_features=h1, bias=True),
+                    nn.ReLU(),
+                ])
+        self.net.pop() # pop the last ReLU for the output layer
+        self.net = nn.Sequential(*self.net) # Ansatz amplitudes are modeled by a masked MLP
+        self.log_softmax = nn.LogSoftmax(dim=-1)
+        self.net_phase = [nn.Linear(in_features=num_sites-2, out_features=made_width, bias=True)]
+        for i in range(made_depth):
+            self.net_phase += [nn.ReLU(), nn.Linear(in_features=made_width, out_features=made_width, bias=True)]
+        self.net_phase += [nn.ReLU(), nn.Linear(in_features=made_width, out_features=4, bias=True)]
+        self.net_phase = nn.Sequential(*self.net_phase) # Ansatz phases modleed by separate MLP
+        # create masks
+        self.get_masks() # builds the initial self.m connectivity
+        self.set_masks()
+
+    def get_masks(self):
+        '''
+        Builds masks to preserve autoregressive property of amplitude MLP
+        '''
+        self.m = {}
+        self.seed = 0 # for cycling through num_masks orderings
+        L = len(self.hidden_sizes)
+        # fetch the next seed and construct a random stream
+        rng = np.random.RandomState(self.seed)
+        # sample the reverse order of the inputs and the connectivity of all neurons
+        self.input_order = np.stack([np.arange(self.num_sites-2,-1,-2), np.arange(self.num_sites-1,-1,-2)],1).reshape(-1) # [4,5,2,3,0,1]
+        self.shell_order = torch.arange(self.num_sites//2-1, -1, -1) # [2,1,0]
+        self.m[-1] = self.input_order // 2 # [2,2,1,1,0,0]
+        for l in range(L):
+            self.m[l] = rng.randint(self.m[l-1].min(), self.num_sites//2-1, size=self.hidden_sizes[l])
+        # construct the mask matrices
+        masks = [self.m[l-1][:,None] <= self.m[l][None,:] for l in range(L)]
+        masks.append(self.m[L-1][:,None] < np.repeat(self.m[-1], 2)[None,:])
+        for i,_ in enumerate(masks):
+            masks[i] = torch.tensor(np.array(masks[i]))
+        self.masks = masks
+
+    def set_masks(self):
+        '''
+        Applies masks to current network (must be done every time weights are updated)
+        '''
+        self.device = self.net[0].weight.device
+        layers = [layer for layer in self.net.modules() if isinstance(layer, nn.Linear)]
+        assert len(layers) == len(self.masks)
+        for layer,mask in zip(layers, self.masks):
+            layer.weight.data = layer.weight.data.mul_(mask.to(self.device).T)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        '''
+        Performs model forward pass, returning either conditional probabilities or statevector entries corresponding with x
+        '''
+        # x: [bs, num_sites]
+        self.set_masks()
+        logits_cls = self.net(x).reshape(-1, self.num_sites//2, 4) # bs, num_sites/2, 4
+        if (self.num_spin_up + self.num_spin_down) >= 0:
+            logits_cls = self.apply_constraint(x, logits_cls) # Apply particle/spin number symmetries
+        log_psi_cond = 0.5 * self.log_softmax(logits_cls/self.temperature)
+        if self.sampling:
+            prob_cond = (2 * log_psi_cond).exp()
+            return prob_cond
+        else:
+            idx = self.state2shell(x)
+            log_psi_real = log_psi_cond.gather(-1, idx.unsqueeze(-1)).sum(-1).sum(-1)
+            log_psi_imag = self.net_phase(x[:, :-2]).gather(-1, idx[:, -1].unsqueeze(-1)).squeeze()
+            if log_psi_real.shape[0] == 1:
+                log_psi_imag = log_psi_imag.reshape(log_psi_real.shape)
+            log_psi = torch.stack((log_psi_real, log_psi_imag), dim=-1)
+            return log_psi
+
+    def apply_constraint(self, inp: torch.Tensor, log_psi_cond: torch.Tensor) -> torch.Tensor:
+        '''
+        Applies particple number and spin symmetries to ansatz state
+        Args:
+            inp: qubit spin configurations
+            log_psi_cond: corresponding conditional log-probabilities
+        Returns:
+            log_psi_cond: masked conditional log-probabilities, preserving symmetries
+        '''
+        device = inp.device
+        N = inp.shape[-1] // 2
+        inp_up = inp[:, self.input_order][:, ::2].clone()
+        inp_down = inp[:, self.input_order][:, 1::2].clone()
+        inp_cumsum_up = torch.cat((torch.zeros((inp_up.shape[0],1)).to(device), ((1 + inp_up)/2).cumsum(-1)[:, :-1]), axis=-1)
+        inp_cumsum_down = torch.cat((torch.zeros((inp_down.shape[0],1)).to(device), ((1 + inp_down)/2).cumsum(-1)[:, :-1]), axis=-1)
+        upper_bound_up = self.num_spin_up
+        lower_bound_up = (self.num_spin_up - (N - torch.arange(1, N+1)))
+        condition1_up = (inp_cumsum_up < lower_bound_up.to(device)).float()
+        condition2_up = (inp_cumsum_up >= upper_bound_up).float()
+        upper_bound_down = self.num_spin_down
+        lower_bound_down = (self.num_spin_down - (N - torch.arange(1, N+1)))
+        condition1_down = (inp_cumsum_down < lower_bound_down.to(device)).float()
+        condition2_down = (inp_cumsum_down >= upper_bound_down).float()
+        idx = torch.sort(self.shell_order)[1]
+        # first entry must be down
+        log_psi_cond[:,:,0].masked_fill_(condition1_up[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,2].masked_fill_(condition1_up[:,idx]==1, float('-inf'))
+        # second entry must be down
+        log_psi_cond[:,:,0].masked_fill_(condition1_down[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,1].masked_fill_(condition1_down[:,idx]==1, float('-inf'))
+        # first entry must be up
+        log_psi_cond[:,:,1].masked_fill_(condition2_up[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,3].masked_fill_(condition2_up[:,idx]==1, float('-inf'))
+        # second entry must be up
+        log_psi_cond[:,:,2].masked_fill_(condition2_down[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,3].masked_fill_(condition2_down[:,idx]==1, float('-inf'))
+        return log_psi_cond
+
+    @torch.no_grad()
+    def sample(self, bs: int, num_samples: int) -> [torch.Tensor, torch.Tensor]:
+        '''
+        MADE-specific Monte Carlo sampler function
+        '''
+        self.eval()
+        self.sampling = True
+        sample_multinomial = True
+        # random initialize a configuration of values +- 1
+        uniq_samples = (torch.randn(1, self.num_sites).to(self.device) > 0.0).float() * 2 - 1
+        uniq_count = torch.tensor([num_samples]).to(self.device)
+        for i in self.shell_order:
+            prob = self.forward(uniq_samples)[:, i] # num_uniq, 4
+            num_uniq = uniq_samples.shape[0]
+            uniq_samples = uniq_samples.repeat(4,1) # 4*num_uniq, num_sites
+            # convert [|-1,-1>, |1,-1>, |-1,1>, |1,1>] to [0, 1, 2, 3]
+            uniq_samples[:num_uniq, 2*i] = -1
+            uniq_samples[:num_uniq, 2*i+1] = -1
+            uniq_samples[num_uniq:2*num_uniq, 2*i] = 1
+            uniq_samples[num_uniq:2*num_uniq, 2*i+1] = -1
+            uniq_samples[2*num_uniq:3*num_uniq, 2*i] = -1
+            uniq_samples[2*num_uniq:3*num_uniq, 2*i+1] = 1
+            uniq_samples[3*num_uniq:4*num_uniq, 2*i] = 1
+            uniq_samples[3*num_uniq:4*num_uniq, 2*i+1] = 1
+            if sample_multinomial:
+                uniq_count = torch.tensor(self.multinomial_arr(uniq_count.long().data.cpu().numpy(), prob.data.cpu().numpy())).T.flatten().to(prob.device)
+            else:
+                uniq_count = (uniq_count.unsqueeze(-1)*prob).T.flatten().round()
+            keep_idx = uniq_count > 0
+            uniq_samples = uniq_samples[keep_idx]
+            uniq_count = uniq_count[keep_idx]
+            uniq_samples = uniq_samples[uniq_count.sort()[1][-2*bs:]]
+            uniq_count = uniq_count[uniq_count.sort()[1][-2*bs:]]
+        uniq_samples = uniq_samples[uniq_count.sort()[1][-bs:]]
+        uniq_count = uniq_count[uniq_count.sort()[1][-bs:]]
+        self.sampling = False
+        self.train()
+        return [uniq_samples, uniq_count]
diff --git a/examples/neural_quantum_states/src/models/retnet.py b/examples/neural_quantum_states/src/models/retnet.py
new file mode 100644
index 0000000..5009138
--- /dev/null
+++ b/examples/neural_quantum_states/src/models/retnet.py
@@ -0,0 +1,267 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from yet_another_retnet.retnet import RetNetDecoderLayer, RetNetDecoder
+
+from .base import Base
+
+class NNQSRetNet(Base):
+    def __init__(self, num_sites: int, num_spin_up: int, num_spin_down: int, made_width: int=64, made_depth: int=2, embedding_dim: int=16, nhead: int=2, dim_feedforward: int=64, num_layers: int=1, temperature: float=1.0, device: str=None, **kwargs):
+        '''
+        Retentive network (RetNet) NQS ansatz
+        Child class specific args:
+            made_width: width of phase network hidden layers
+            made_depth: number of phase network hidden layers
+            embedding_dim: dimension of RetNet embedding
+            nhead: number of multi-scale retention heads
+            dim_feedforward: dimension of RetNet feedforward layers
+            num_layers: number of RetNet blocks
+            temperature: RetNet softmax temperature parameter
+            device: device on which the model is stored
+        '''
+        super(NNQSRetNet, self).__init__(num_sites, num_spin_up, num_spin_down, device)
+
+        # construct model
+        self.num_in, self.num_out = num_sites, num_sites*2
+        self.temperature=temperature
+        self.input_order = np.stack([np.arange(self.num_sites-2,-1,-2), np.arange(self.num_sites-1,-1,-2)],1).reshape(-1) # [4,5,2,3,0,1]
+        self.shell_order = torch.arange(self.num_sites//2-1, -1, -1) # [2,1,0]
+        
+        decoder_layer = RetNetDecoderLayer(embedding_dim, nhead, dim_feedforward=dim_feedforward, dropout=0.0)
+        self.decoder = RetNetDecoder(decoder_layer, num_layers)
+        self.fc = nn.Linear(embedding_dim, 4)
+        self.tok_emb = nn.Embedding(5, embedding_dim)
+        self.pos_emb = nn.Embedding(len(self.shell_order), embedding_dim)
+        self.apply(self._init_weights)
+        self.log_softmax = nn.LogSoftmax(dim=-1)
+        self.softmax = nn.Softmax(dim=-1)
+
+        self.net_phase = [nn.Linear(in_features=self.num_in-2, out_features=made_width, bias=True)]
+        for i in range(made_depth):
+            self.net_phase += [nn.ReLU(), nn.Linear(in_features=made_width, out_features=made_width, bias=True)]
+        self.net_phase += [nn.ReLU(), nn.Linear(in_features=made_width, out_features=4, bias=True)]
+        self.net_phase = nn.Sequential(*self.net_phase)
+
+    def _init_weights(self, module: nn.Module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            torch.nn.init.zeros_(module.bias)
+            torch.nn.init.ones_(module.weight)
+
+    def forward_sample(self, x_shell: torch.Tensor, count_up: torch.Tensor, count_down: torch.Tensor, idx: int=0, prev_states: list = []) -> [torch.Tensor, list]:
+        '''
+        Performs a forward pass on one input token for purposes of autoregressive sampling
+        Args:
+            x_shell: qubit spin configurations for most recently sampled molecular orbital (2 qubits)
+            count_up: spin-up occupancies for previously sampled shells
+            count_down: spin-down occupancies for previously sampled shells
+            idx: index for currently sampled orbital
+            prev_states: internal states for RetNets
+        Returns:
+            prob_cond: conditional probabilities for orbital 'idx'
+            prev_states: new internal RetNet states
+        '''
+        # x: [bs, 1]
+        input = self.tok_emb(x_shell) + self.pos_emb(idx.to(self.device))
+        # new x is of shape (batch_size, d_model)
+        output, prev_states = self.decoder.forward_recurrent(input, self.shell_order[idx], prev_states)
+      
+        logits_cls = self.fc(output)
+        if (self.num_spin_up + self.num_spin_down) >= 0:
+            logits_cls = self.apply_constraint_sample(logits_cls, idx, count_up, count_down)
+        prob_cond = self.softmax(logits_cls/self.temperature)
+        
+        return prob_cond, prev_states
+
+    def forward_state(self, x: torch.Tensor) -> torch.Tensor:
+        '''
+        Performs a forward pass of the ansatz model to produce statevector entries (recurrent RetNet if self.inference, parallel RetNet if not)
+        Args:
+            x: qubit spin configurations
+        Returns:
+            log_psi: logarithms of ansatz entries
+        '''
+        # x: [bs, num_sites]
+        shells = self.state2shell(x)[:, self.shell_order]
+        input = 4*torch.ones(shells.shape, dtype=torch.int).to(self.device) 
+        input[:,1:] = shells[:,:-1]
+        # x is of shape (batch_size, sequence_length)
+        pos = self.shell_order.to(self.device)
+
+        input = self.tok_emb(input) + self.pos_emb(pos)
+        # new x is of shape (batch_size, sequence_length, d_model)
+
+        if self.inference:
+            outputs = []
+            prev_states = []
+            for i in range(len(self.shell_order)):
+                out, prev_states = self.decoder.forward_recurrent(input[:,i].squeeze(), i, prev_states)
+                outputs.append(torch.unsqueeze(out, 1))
+            output = torch.stack(outputs, dim=1)
+        else:
+            output = self.decoder.forward_parallel(input)
+        output = self.fc(output)[:, self.shell_order]
+        if (self.num_spin_up + self.num_spin_down) >= 0:
+            logits_cls = self.apply_constraint_state(x, output)
+        log_psi_cond = 0.5 * self.log_softmax(logits_cls/self.temperature)
+        idx = self.state2shell(x)
+        log_psi_real = log_psi_cond.gather(-1, idx.unsqueeze(-1)).sum(-1).sum(-1)
+        log_psi_imag = self.net_phase(x[:, :-2]).gather(-1, idx[:, -1].unsqueeze(-1)).squeeze()
+        if log_psi_real.shape[0] == 1:
+            log_psi_imag = log_psi_imag.reshape(log_psi_real.shape)
+        log_psi = torch.stack((log_psi_real, log_psi_imag), dim=-1)
+        return log_psi
+
+    def forward(self, x: torch.Tensor, **kwargs):
+        '''
+        Wrapper function that directs to either self.forward sample or self.forward_state, depending on self.sampling
+        '''
+        if self.sampling:
+            return self.forward_sample(x, **kwargs)
+        else:
+            return self.forward_state(x)
+
+    def apply_constraint_sample(self, log_psi_cond: torch.Tensor, shell_idx: torch.Tensor, sum_up: torch.Tensor, sum_down: torch.Tensor) -> torch.Tensor:
+        '''
+        Applies particle + spin number constraints to a single molecular orbital occupancy, for use in recurrent RetNet
+        Args:
+            log_psi_cond: intermediate output of modulus network for one orbital
+            shell_idx: orbital index corresponding with 'log_psi_cond'
+            sum_up: spin-up occupancy count for previously sampled orbitals
+            sum_down: spin-down occupancies for previously sampled orbitals
+        Returns:
+            log_spi_cond: Properly masked RetNet outputs
+        '''
+        condition1_up = (sum_up < self.lower_bound_up[self.shell_order[shell_idx]]).float()
+        condition2_up = (sum_up >= self.upper_bound_up).float()
+
+        condition1_down = (sum_down < self.lower_bound_down[self.shell_order[shell_idx]]).float()
+        condition2_down = (sum_down >= self.upper_bound_down).float()
+        # first entry must be down
+        log_psi_cond[:,0].masked_fill_(condition1_up==1, float('-inf'))
+        log_psi_cond[:,2].masked_fill_(condition1_up==1, float('-inf'))
+        # second entry must be down
+        log_psi_cond[:,0].masked_fill_(condition1_down==1, float('-inf'))
+        log_psi_cond[:,1].masked_fill_(condition1_down==1, float('-inf'))
+        # first entry must be up
+        log_psi_cond[:,1].masked_fill_(condition2_up==1, float('-inf'))
+        log_psi_cond[:,3].masked_fill_(condition2_up==1, float('-inf'))
+        # second entry must be up
+        log_psi_cond[:,2].masked_fill_(condition2_down==1, float('-inf'))
+        log_psi_cond[:,3].masked_fill_(condition2_down==1, float('-inf'))
+        return log_psi_cond
+
+    def apply_constraint_state(self, inp: torch.Tensor, log_psi_cond: torch.Tensor) -> torch.Tensor:
+        '''
+        Applies particle + spin number symmetry constraints to parallel RetNet outputs
+        Args:
+            inp: qubit spin configurations
+            log_psi_cond: intermediate parallel RetNet outputs
+        Returns:
+            log_psi_cond: properly masked conditional log-probabilities
+        '''
+        device = inp.device
+        N = inp.shape[-1] // 2
+        inp_up = inp[:, self.input_order][:, ::2].clone()
+        inp_down = inp[:, self.input_order][:, 1::2].clone()
+        inp_cumsum_up = torch.cat((torch.zeros((inp_up.shape[0],1)).to(device), ((inp_up + 1)/2).cumsum(-1)[:, :-1]), axis=-1)
+        inp_cumsum_down = torch.cat((torch.zeros((inp_down.shape[0],1)).to(device), ((inp_down + 1)/2).cumsum(-1)[:, :-1]), axis=-1)
+        upper_bound_up = self.num_spin_up
+        lower_bound_up = (self.num_spin_up - (N - torch.arange(1, N+1)))
+        condition1_up = (inp_cumsum_up < lower_bound_up.to(device)).float()
+        condition2_up = (inp_cumsum_up >= upper_bound_up).float()
+        upper_bound_down = self.num_spin_down
+        lower_bound_down = (self.num_spin_down - (N - torch.arange(1, N+1)))
+        condition1_down = (inp_cumsum_down < lower_bound_down.to(device)).float()
+        condition2_down = (inp_cumsum_down >= upper_bound_down).float()
+        idx = torch.sort(self.shell_order)[1]
+        # first entry must be down
+        log_psi_cond[:,:,0].masked_fill_(condition1_up[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,2].masked_fill_(condition1_up[:,idx]==1, float('-inf'))
+        # second entry must be down
+        log_psi_cond[:,:,0].masked_fill_(condition1_down[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,1].masked_fill_(condition1_down[:,idx]==1, float('-inf'))
+        # first entry must be up
+        log_psi_cond[:,:,1].masked_fill_(condition2_up[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,3].masked_fill_(condition2_up[:,idx]==1, float('-inf'))
+        # second entry must be up
+        log_psi_cond[:,:,2].masked_fill_(condition2_down[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,3].masked_fill_(condition2_down[:,idx]==1, float('-inf'))
+        return log_psi_cond
+
+    @torch.no_grad()
+    def sample(self, bs: int, num_samples: int) -> [torch.Tensor, torch.Tensor]:
+        '''
+        RetNet-specific Monte Carlo sampling function
+        '''
+        self.eval()
+        self.sampling = True
+        sample_multinomial = True
+        # Create arrays for conditions
+        N = self.num_sites//2
+        self.upper_bound_up = self.num_spin_up
+        self.lower_bound_up = (self.num_spin_up - (N - torch.arange(1, N+1)))
+        self.upper_bound_down = self.num_spin_down
+        self.lower_bound_down = (self.num_spin_down - (N - torch.arange(1, N+1)))
+
+        # randomly initialize a configuration of values +- 1
+        uniq_samples = (torch.randn(1, self.num_sites).to(self.device) > 0.0).float() * 2 - 1
+        uniq_count = torch.tensor([num_samples]).to(self.device)
+        count_up = torch.zeros(1).to(self.device)
+        count_down = torch.zeros(1).to(self.device)
+        prev_states = []
+        for index in range(len(self.shell_order)):
+            i = self.shell_order[index]
+            if index == 0:
+                model_input = 4*torch.ones(uniq_samples.shape[0], dtype=torch.int64).to(self.device)
+            else:
+                model_input = self.state2shell(uniq_samples[:, 2*self.shell_order[index - 1]:2*(1 + self.shell_order[index - 1])]).squeeze()
+                if uniq_samples.shape[0] == 1:
+                    model_input = model_input.unsqueeze(0)
+            prob, prev_states = self.forward(model_input, count_up=count_up, count_down=count_down, idx=i, prev_states=prev_states) # num_uniq, 4
+            num_uniq = uniq_samples.shape[0]
+            uniq_samples = uniq_samples.repeat(4,1) # 4*num_uniq, num_sites
+            count_up = torch.cat(4*[count_up])
+            count_down = torch.cat(4*[count_down])
+            # convert [|-1,-1>, |1,-1>, |-1,1>, |1,1>] to [0, 1, 2, 3]
+            uniq_samples[:num_uniq, 2*i] = -1
+            uniq_samples[:num_uniq, 2*i+1] = -1
+            uniq_samples[num_uniq:2*num_uniq, 2*i] = 1
+            uniq_samples[num_uniq:2*num_uniq, 2*i+1] = -1
+            uniq_samples[2*num_uniq:3*num_uniq, 2*i] = -1
+            uniq_samples[2*num_uniq:3*num_uniq, 2*i+1] = 1
+            uniq_samples[3*num_uniq:4*num_uniq, 2*i] = 1
+            uniq_samples[3*num_uniq:4*num_uniq, 2*i+1] = 1
+            count_up[num_uniq:2*num_uniq] += 1
+            count_up[3*num_uniq:4*num_uniq] += 1
+            count_down[2*num_uniq:3*num_uniq] += 1
+            count_down[3*num_uniq:4*num_uniq] += 1
+            if sample_multinomial:
+                uniq_count = torch.tensor(self.multinomial_arr(uniq_count.long().data.cpu().numpy(), prob.data.cpu().numpy())).T.flatten().to(prob.device)
+            else:
+                uniq_count = (uniq_count.unsqueeze(-1)*prob).T.flatten().round()
+            keep_idx = uniq_count > 0
+            uniq_samples = uniq_samples[keep_idx]
+            uniq_count = uniq_count[keep_idx]
+            prev_state_idx = torch.arange(num_uniq).repeat(4).to(self.device)
+            prev_state_idx = prev_state_idx[keep_idx]
+            count_up = count_up[keep_idx]
+            count_down = count_down[keep_idx]
+            uniq_samples = uniq_samples[uniq_count.sort()[1][-2*bs:]]
+            prev_state_idx = prev_state_idx[uniq_count.sort()[1][-2*bs:]]
+            for j in range(len(prev_states)):
+                prev_states[j] = prev_states[j][prev_state_idx]
+            count_up = count_up[uniq_count.sort()[1][-2*bs:]]
+            count_down = count_down[uniq_count.sort()[1][-2*bs:]]
+            uniq_count = uniq_count[uniq_count.sort()[1][-2*bs:]]
+        uniq_samples = uniq_samples[uniq_count.sort()[1][-bs:]]
+        uniq_count = uniq_count[uniq_count.sort()[1][-bs:]]
+        self.sampling = False
+        self.train()
+        return [uniq_samples, uniq_count]
diff --git a/examples/neural_quantum_states/src/models/transformer.py b/examples/neural_quantum_states/src/models/transformer.py
new file mode 100644
index 0000000..64a32af
--- /dev/null
+++ b/examples/neural_quantum_states/src/models/transformer.py
@@ -0,0 +1,172 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .base import Base
+
+class NNQSTransformer(Base):
+    def __init__(self, num_sites: int, num_spin_up: int, num_spin_down: int, made_width: int=64, made_depth: int=2, embedding_dim: int=16, nhead: int=2, dim_feedforward: int=64, num_layers: int=1, temperature: float=1.0, device: str=None, **kwargs):
+        '''
+        A Transformer-based autoregressive NQS Ansatz
+        Child class specific args:
+            made_width: width of phase network hidden layers
+            made_depth: number of phase network hidden layers
+            embedding_dim: dimension of transformer hidden states
+            nhead: number of attention heads
+            dim_feedforward: dimension of transformer feedforward layer
+            num_layers: number of transformer blocks
+            temperature: modulus network softmax temperature parameter
+            device: device to store model on
+        '''
+        super(NNQSTransformer, self).__init__(num_sites, num_spin_up, num_spin_down, device)
+
+        # construct model
+        self.num_in, self.num_out = num_sites, num_sites*2
+        self.temperature = temperature
+        self.input_order = np.stack([np.arange(self.num_sites-2,-1,-2), np.arange(self.num_sites-1,-1,-2)],1).reshape(-1) # [4,5,2,3,0,1]
+        self.input_order = torch.Tensor(self.input_order).int().to(self.device)
+        self.shell_order = torch.arange(self.num_sites//2-1, -1, -1) # [2,1,0]
+        
+        transformer_layer = nn.TransformerEncoderLayer(embedding_dim, nhead, dim_feedforward=dim_feedforward, dropout=0.0, batch_first=True)
+        self.transformer = nn.TransformerEncoder(transformer_layer, num_layers)
+        self.fc = nn.Linear(embedding_dim, 4)
+        self.tok_emb = nn.Embedding(5, embedding_dim)
+        self.pos_emb = nn.Embedding(len(self.shell_order), embedding_dim)
+        self.apply(self._init_weights)
+        self.softmax = nn.Softmax(dim=-1)
+        self.log_softmax = nn.LogSoftmax(dim=-1)
+
+        self.net_phase = [nn.Linear(in_features=self.num_in-2, out_features=made_width, bias=True)]
+        for i in range(made_depth):
+            self.net_phase += [nn.ReLU(), nn.Linear(in_features=made_width, out_features=made_width, bias=True)]
+        self.net_phase += [nn.ReLU(), nn.Linear(in_features=made_width, out_features=4, bias=True)]
+        self.net_phase = nn.Sequential(*self.net_phase)
+
+        self.mask = torch.zeros((len(self.shell_order), len(self.shell_order))).to(self.device)
+        for i in range(len(self.mask)):
+            for j in range(len(self.mask)):
+                if i < j:
+                    self.mask[i][j] = float('-inf') 
+        
+    def _init_weights(self, module: nn.Module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            torch.nn.init.zeros_(module.bias)
+            torch.nn.init.ones_(module.weight)
+
+    def forward(self, x: torch.Tensor, sample_shell: int = 0) -> torch.Tensor:
+        '''
+        Forward function for Transformer ansatz (used for both sampling and training)
+        Args:
+            x: qubit spin configuration
+            sample_shell: shell index provided during sampling to avoid extraneous forward passes
+        Returns:
+            prob_cond/log_psi: either conditional probabilities of logarithms of statevector entries, depending on if sampling
+        '''
+        # x: [bs, num_sites]
+        shells = self.state2shell(x)[:, self.shell_order]
+        input = 4*torch.ones(shells.shape, dtype=torch.int64).to(self.device) 
+        input[:,1:] = shells[:,:-1]
+        pos = self.shell_order.to(self.device)
+
+        input = self.tok_emb(input) + self.pos_emb(pos)
+        # new x is of shape (batch_size, sequence_length, d_model)
+        
+        if self.mask.device != self.device:
+            self.mask = self.mask.to(self.device)
+        output = self.transformer(input[:,:(len(self.shell_order) - sample_shell + 1)], mask=self.mask[:(len(self.shell_order) - sample_shell + 1),:(len(self.shell_order) - sample_shell + 1)], is_causal=True)
+        output = self.fc(output)
+        if output.shape[1] < len(self.shell_order):
+            new_output = torch.zeros(output.shape[0], len(self.shell_order), output.shape[2]).to(self.device)
+            new_output[:,:output.shape[1],:] = output
+            output = new_output[:, self.shell_order]
+        else:
+            output = output[:, self.shell_order]
+        if (self.num_spin_up + self.num_spin_down) >= 0:
+            logits_cls = self.apply_constraint(x, output)
+        logits_cls /= self.temperature
+        if self.sampling:
+            prob_cond = self.softmax(logits_cls)
+            return prob_cond
+        else:
+            log_psi_cond = 0.5 * self.log_softmax(logits_cls)
+            idx = self.state2shell(x)
+            log_psi_real = log_psi_cond.gather(-1, idx.unsqueeze(-1)).sum(-1).sum(-1)
+            log_psi_imag = self.net_phase(x[:, :-2]).gather(-1, idx[:, -1].unsqueeze(-1)).squeeze()
+            if log_psi_real.shape[0] == 1:
+                log_psi_imag = log_psi_imag.reshape(log_psi_real.shape)
+            log_psi = torch.stack((log_psi_real, log_psi_imag), dim=-1)
+            return log_psi
+
+    def apply_constraint(self, inp, log_psi_cond):
+        # convert [|-1,-1>, |1,-1>, |-1,1>, |1,1>] to [0, 1, 2, 3]
+        device = inp.device
+        N = inp.shape[-1] // 2
+        inp_up = inp[:, self.input_order][:, ::2].clone()
+        inp_down = inp[:, self.input_order][:, 1::2].clone()
+        inp_cumsum_up = torch.cat((torch.zeros((inp_up.shape[0],1)).to(device), ((1 + inp_up)/2).cumsum(-1)[:, :-1]), axis=-1)
+        inp_cumsum_down = torch.cat((torch.zeros((inp_down.shape[0],1)).to(device), ((1 + inp_down)/2).cumsum(-1)[:, :-1]), axis=-1)
+        upper_bound_up = self.num_spin_up
+        lower_bound_up = (self.num_spin_up - (N - torch.arange(1, N+1)))
+        condition1_up = (inp_cumsum_up < lower_bound_up.to(device)).float()
+        condition2_up = (inp_cumsum_up >= upper_bound_up).float()
+        upper_bound_down = self.num_spin_down
+        lower_bound_down = (self.num_spin_down - (N - torch.arange(1, N+1)))
+        condition1_down = (inp_cumsum_down < lower_bound_down.to(device)).float()
+        condition2_down = (inp_cumsum_down >= upper_bound_down).float()
+        idx = torch.sort(self.shell_order)[1]
+        # first entry must be down
+        log_psi_cond[:,:,0].masked_fill_(condition1_up[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,2].masked_fill_(condition1_up[:,idx]==1, float('-inf'))
+        # second entry must be down
+        log_psi_cond[:,:,0].masked_fill_(condition1_down[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,1].masked_fill_(condition1_down[:,idx]==1, float('-inf'))
+        # first entry must be up
+        log_psi_cond[:,:,1].masked_fill_(condition2_up[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,3].masked_fill_(condition2_up[:,idx]==1, float('-inf'))
+        # second entry must be up
+        log_psi_cond[:,:,2].masked_fill_(condition2_down[:,idx]==1, float('-inf'))
+        log_psi_cond[:,:,3].masked_fill_(condition2_down[:,idx]==1, float('-inf'))
+        return log_psi_cond
+
+    @torch.no_grad()
+    def sample(self, bs, num_samples):
+        self.eval()
+        self.sampling = True
+        sample_multinomial = True
+        # random initialize a configuration of values +- 1
+        uniq_samples = (torch.randn(1, self.num_sites).to(self.device) > 0.0).float() * 2 - 1
+        uniq_count = torch.tensor([num_samples]).to(self.device)
+        for i in self.shell_order:
+            prob = self.forward(uniq_samples, i)[:, i] # num_uniq, 4
+            num_uniq = uniq_samples.shape[0]
+            uniq_samples = uniq_samples.repeat(4,1) # 4*num_uniq, num_sites
+            # convert [|-1,-1>, |1,-1>, |-1,1>, |1,1>] to [0, 1, 2, 3]
+            uniq_samples[:num_uniq, 2*i] = -1
+            uniq_samples[:num_uniq, 2*i+1] = -1
+            uniq_samples[num_uniq:2*num_uniq, 2*i] = 1
+            uniq_samples[num_uniq:2*num_uniq, 2*i+1] = -1
+            uniq_samples[2*num_uniq:3*num_uniq, 2*i] = -1
+            uniq_samples[2*num_uniq:3*num_uniq, 2*i+1] = 1
+            uniq_samples[3*num_uniq:4*num_uniq, 2*i] = 1
+            uniq_samples[3*num_uniq:4*num_uniq, 2*i+1] = 1
+            if sample_multinomial:
+                uniq_count = torch.tensor(self.multinomial_arr(uniq_count.long().data.cpu().numpy(), prob.data.cpu().numpy())).T.flatten().to(prob.device)
+            else:
+                uniq_count = (uniq_count.unsqueeze(-1)*prob).T.flatten().round()
+            keep_idx = uniq_count > 0
+            uniq_samples = uniq_samples[keep_idx]
+            uniq_count = uniq_count[keep_idx]
+            uniq_samples = uniq_samples[uniq_count.sort()[1][-2*bs:]]
+            uniq_count = uniq_count[uniq_count.sort()[1][-2*bs:]]
+        uniq_samples = uniq_samples[uniq_count.sort()[1][-bs:]]
+        uniq_count = uniq_count[uniq_count.sort()[1][-bs:]]
+        self.sampling = False
+        self.train()
+        return [uniq_samples, uniq_count]
diff --git a/examples/neural_quantum_states/src/objective/__init__.py b/examples/neural_quantum_states/src/objective/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/__init__.cpython-312.pyc b/examples/neural_quantum_states/src/objective/__pycache__/__init__.cpython-312.pyc
new file mode 100644
index 0000000..3306e2a
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/__init__.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/__init__.cpython-39.pyc b/examples/neural_quantum_states/src/objective/__pycache__/__init__.cpython-39.pyc
new file mode 100644
index 0000000..5a555ce
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/__init__.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/adaptive_shadows.cpython-39.pyc b/examples/neural_quantum_states/src/objective/__pycache__/adaptive_shadows.cpython-39.pyc
new file mode 100644
index 0000000..5a69281
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/adaptive_shadows.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/automatic.cpython-312.pyc b/examples/neural_quantum_states/src/objective/__pycache__/automatic.cpython-312.pyc
new file mode 100644
index 0000000..a79f8bd
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/automatic.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/automatic.cpython-39.pyc b/examples/neural_quantum_states/src/objective/__pycache__/automatic.cpython-39.pyc
new file mode 100644
index 0000000..070d00d
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/automatic.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/hamiltonian.cpython-312.pyc b/examples/neural_quantum_states/src/objective/__pycache__/hamiltonian.cpython-312.pyc
new file mode 100644
index 0000000..bd90136
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/hamiltonian.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/hamiltonian.cpython-39.pyc b/examples/neural_quantum_states/src/objective/__pycache__/hamiltonian.cpython-39.pyc
new file mode 100644
index 0000000..af5bb45
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/hamiltonian.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/naive_sampler.cpython-39.pyc b/examples/neural_quantum_states/src/objective/__pycache__/naive_sampler.cpython-39.pyc
new file mode 100644
index 0000000..7d7cad0
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/naive_sampler.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/util.cpython-312.pyc b/examples/neural_quantum_states/src/objective/__pycache__/util.cpython-312.pyc
new file mode 100644
index 0000000..2b16199
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/util.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/__pycache__/util.cpython-39.pyc b/examples/neural_quantum_states/src/objective/__pycache__/util.cpython-39.pyc
new file mode 100644
index 0000000..96e885b
Binary files /dev/null and b/examples/neural_quantum_states/src/objective/__pycache__/util.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/objective/adaptive_shadows.py b/examples/neural_quantum_states/src/objective/adaptive_shadows.py
new file mode 100644
index 0000000..1878d92
--- /dev/null
+++ b/examples/neural_quantum_states/src/objective/adaptive_shadows.py
@@ -0,0 +1,147 @@
+import numpy as np
+import torch
+
+from collections import Counter
+from src.complex import exp, scalar_mult, real, norm_square
+from .hamiltonian import Hamiltonian
+
+
+class AdaptiveShadows(Hamiltonian):
+    '''
+    A variation of Automatic that stochastically estimates the Hamiltonian using Adaptive Pauli Shadows, a classical shadows--based Monte Carlo method that samples from Pauli strings: https://arxiv.org/abs/2105.12207
+    This estimated Hamiltonian can then be used to generate local energy estimates during NQS training. Code has not been adquately tested and its accuracy is not completely known
+    Args:
+        hamiltonian_string: Pauli string representation of Hamiltonian
+        num_sites: qubit number of system
+    '''
+    def __init__(self, hamiltonian_string, num_sites, sample_count, total_unique_samples, reset_prob, flip_bs, **kwargs):
+        super(AdaptiveShadows, self).__init__(hamiltonian_string, num_sites)
+        # product of identity operators by default, encoded as 0
+        self.coefficients = torch.stack((self.coefficients.real, self.coefficients.imag), dim=-1)
+        self.coefficients_square = norm_square(self.coefficients)
+        self.sample_count = sample_count
+        self.total_unique_samples = total_unique_samples
+        self.reset_prob = reset_prob
+        self.flip_bs = flip_bs
+        self.sample_X_idx, self.sample_Y_idx, self.sample_Z_idx, cover_list = self.generate_sample_paulis(self.sample_count)
+        self.covers = {}
+        self.sample_coeffs = torch.zeros(self.sample_count, 2)
+        self.sample_counts = Counter()
+        for i in range(len(cover_list)):
+            cover = cover_list[i]
+            self.covers[i] = cover
+            self.sample_counts.update(cover)
+        del(cover_list)
+        self.generate_coefficients()
+        self.generate_loss_idxs()
+        self.counter = 0 # Counter to keep track of which term to replace when updating sample list
+        
+    def generate_coefficients(self):
+        for i in range(len(self.sample_Z_idx)):
+            cover = self.covers[i]
+            keys = list(cover.keys())
+            self.sample_coeffs[i] = (self.coefficients[keys]/torch.tensor([self.sample_counts[key] for key in keys]).to(self.coefficients.device).reshape(-1,1)).sum(0).to(self.sample_coeffs.device)
+        part1 = (-1j)**self.sample_Y_idx.sum(-1)
+        part1 = torch.stack((part1.real, part1.imag), dim=-1).float().to(self.sample_coeffs.device)
+        self.sample_coeffs = scalar_mult(self.sample_coeffs, part1)
+    
+    def generate_loss_idxs(self):
+        flip_idx = self.sample_X_idx + self.sample_Y_idx
+        self.select_idx = self.sample_Y_idx + self.sample_Z_idx
+        self.unique_flips, self.unique_indices = torch.unique(flip_idx, return_inverse=True, dim=0)
+        self.unique_flips = 1 - 2*self.unique_flips.unsqueeze(0)
+        self.unique_num_terms = self.unique_flips.shape[1]
+
+    def update_sample_batch(self):
+        new_sample_x, new_sample_y, new_sample_z, new_cover = self.generate_sample_paulis(1)
+        if self.sample_X_idx.shape[0] < self.total_unique_samples:
+            self.sample_X_idx = torch.cat((self.sample_X_idx, new_sample_x.unsqueeze(0)), dim=0)
+            self.sample_Y_idx = torch.cat((self.sample_Y_idx, new_sample_y.unsqueeze(0)), dim=0)
+            self.sample_Z_idx = torch.cat((self.sample_Z_idx, new_sample_z.unsqueeze(0)), dim=0)
+            self.covers[self.sample_X_idx.shape[0] - 1] = new_cover[0]
+            self.sample_counts.update(new_cover[0])
+            self.sample_coeffs = torch.cat((self.sample_coeffs, torch.zeros(1,2).to(self.sample_coeffs.device)), dim=0)
+        else:
+            self.sample_X_idx[self.counter] = new_sample_x
+            self.sample_Y_idx[self.counter] = new_sample_y
+            self.sample_Z_idx[self.counter] = new_sample_z
+            old_cover = self.covers[self.counter]
+            self.covers[self.counter] = new_cover[0]
+            self.sample_counts.subtract(old_cover)
+            self.sample_counts.update(new_cover[0])
+            self.counter = (self.counter+1)%self.total_unique_samples
+        self.generate_coefficients()
+        self.generate_loss_idxs()
+
+
+    def generate_sample_paulis(self, num_samples):
+        samples = torch.zeros(num_samples, self.input_dim) # 'X' is 1, 'Y' is 2, 'Z' is 3, 'I' is 0
+        covers = [Counter(dict(zip(np.arange(self.num_terms), np.ones(self.num_terms)))) for _ in range(num_samples)] # Dictionaries of potential covers for each Pauli sample
+        for i in range(num_samples):
+            orders = torch.randperm(self.input_dim)
+            for qubit in orders:
+                candidates = list(covers[i].keys())
+                cover_partition = [Counter(), Counter(), Counter(), Counter()] # 'I' is 0, 'X' is 1, 'Y' is 2, 'Z' is 3
+                for term in candidates:
+                    cover_partition[self.operators[term, qubit]][term] = covers[i][term]
+                    del covers[i][term]
+                covers[i].update(cover_partition[0])
+                prob_vector = torch.zeros(4)
+                all_zero = True
+                for j in range(1,4):
+                    if len(cover_partition[j]) == 0:
+                        continue
+                    else:
+                        values = self.coefficients_square[list(cover_partition[j].keys())]
+                        prob = torch.sqrt(sum(values))
+                        if prob > 0:
+                            all_zero = False
+                        prob_vector[j] = prob
+                if all_zero:
+                    prob[1:] = 1.0
+                sample = torch.multinomial(prob_vector, 1)
+                samples[i, qubit] = sample
+                covers[i].update(cover_partition[sample])
+        samples = samples.squeeze().to(self.coefficients.device)
+        return (samples==1).int(), (samples==2).int(), (samples==3).int(), covers
+
+    def compute_local_energy(self, x, model):
+        # see appendix B of https://arxiv.org/pdf/1909.12852.pdf
+        # x [bs, input_dim]
+        bs = x.shape[0]
+        # Randomly regenerate sample Hamiltonian batch
+        if torch.rand(1) < self.reset_prob:
+            self.update_sample_batch()
+        process_bs = int(np.ceil(self.unique_num_terms*bs/self.flip_bs))
+        # first obtain model output for input batch # [bs, 2]
+        log_psi = model(x) # [bs, 2]i
+        with torch.no_grad():
+            # log_psi_k comprises model outputs corresponding to unique flips of the Hamiltonian for each batch sample # [bs*unique_num_terms, 2]
+            log_psi_k = torch.zeros(bs*self.unique_num_terms, 2).to(x.device)
+            x_k = (x.unsqueeze(1) * self.unique_flips).reshape(-1, self.input_dim) # [bs*unique_num_terms, input_dim]
+            # further batching is done to conserve GPU memory footprint
+            for i in range(self.flip_bs):
+                log_psi_k[process_bs*i:process_bs*(i + 1)] = model(x_k[process_bs*i:process_bs*(i+1)])
+        if len(log_psi.shape) == 1: # if not complex
+            log_psi_k = torch.stack([log_psi_k, torch.zeros_like(log_psi_k).to(log_psi_k.device)], dim=-1)
+            log_psi = torch.stack([log_psi, torch.zeros_like(log_psi).to(log_psi.device)], dim=-1)
+        log_psi_k = log_psi_k.reshape(bs, self.unique_num_terms, 2) # [bs, unique_num_terms, 2]
+        log_psi_k = log_psi_k[:, self.unique_indices] # [bs, num_samples, 2]
+        ratio = exp(log_psi_k-log_psi.unsqueeze(1)).detach() # [bs, num_samples, 2]
+        # compute matrix element
+        # Eq. B3
+        part2 = (x.unsqueeze(1).repeat(1, self.sample_coeffs.shape[0], 1) * self.select_idx.unsqueeze(0) + (1-self.select_idx).unsqueeze(0)).prod(-1) # [bs, num_terms]
+        mtx_k = torch.stack((part2, torch.zeros_like(part2)), dim=-1)
+        # total local energy
+        local_energy = scalar_mult(self.sample_coeffs.unsqueeze(0), scalar_mult(mtx_k, ratio)).sum(1) # [bs, 2]
+        return local_energy.detach(), log_psi
+
+    def set_device(self, device):
+        self.coefficients = self.coefficients.to(device)
+        self.select_idx = self.select_idx.to(device)
+        self.unique_flips = self.unique_flips.to(device)
+        self.unique_indices = self.unique_indices.to(device)
+        self.sample_X_idx = self.sample_X_idx.to(device)
+        self.sample_Y_idx = self.sample_Y_idx.to(device)
+        self.sample_Z_idx = self.sample_Z_idx.to(device)
+        self.sample_coeffs = self.sample_coeffs.to(device)
diff --git a/examples/neural_quantum_states/src/objective/automatic.py b/examples/neural_quantum_states/src/objective/automatic.py
new file mode 100644
index 0000000..10c619c
--- /dev/null
+++ b/examples/neural_quantum_states/src/objective/automatic.py
@@ -0,0 +1,93 @@
+import numpy as np
+import torch
+
+from src.complex import exp, scalar_mult
+from .hamiltonian import Hamiltonian
+
+
+class Automatic(Hamiltonian):
+    def __init__(self, hamiltonian_string: str, num_sites: int, flip_bs: int, **kwargs):
+        '''
+        A "standard" NQS Hamiltonian object. Object takes in Pauli string form of Hamiltonian and stores it as four tensors: one storing the unique bit flip indices corresponding with all terms in the Hamiltonian, a tensor mapping those flip indices to the original Hamiltonian terms, a tensor storing the phase flip indices of the Hamiltonian terms, and the scalar coefficients
+        Args:
+            hamiltonian_string: Pauli string representation of Hamiltonian
+            num_sites: qubit number of system
+            flip_bs: largest batch size of model input tensors that each GPU is expected to handle at once
+        '''
+        super(Automatic, self).__init__(hamiltonian_string, num_sites)
+        # product of identity operators by default, encoded as 0
+        self.coefficients = torch.stack((self.coefficients.real, self.coefficients.imag), dim=-1)
+        self.flip_bs = flip_bs
+        # find index of pauli X,Y,Z operators
+        pauli_x_idx = (self.operators==1).int() # [num_terms, input_dim]
+        pauli_y_idx = (self.operators==2).int() # [num_terms, input_dim]
+        pauli_z_idx = (self.operators==3).int() # [num_terms, input_dim]
+        del self.operators
+        # track the exponential of -i
+        self.num_pauli_y = pauli_y_idx.sum(-1) # [num_terms]
+        part1 = (-1j)**self.num_pauli_y.detach()
+        part1 = torch.stack((part1.real, part1.imag), dim=-1).float()
+        self.coefficients = scalar_mult(self.coefficients, part1)
+        del part1
+        # the unique element has flipped value if the corresponding pauli is x or y.
+        flip_idx = pauli_x_idx + pauli_y_idx # [num_terms, input_dim]
+        # self.flip_idx = flip_idx
+        del pauli_x_idx
+        # only the entry value with y or z pauli is multiplied
+        self.select_idx = pauli_y_idx + pauli_z_idx
+        del pauli_y_idx
+        del pauli_z_idx
+        self.unique_flips, self.unique_indices = torch.unique(flip_idx, sorted=True, return_inverse=True, dim=0)
+        self.unique_flips = 1 - 2*(self.unique_flips.unsqueeze(0))
+        self.unique_num_terms = self.unique_flips.shape[1]
+        print('Number of unique flips in Hamiltonian: {}'.format(self.unique_num_terms))
+
+    def compute_local_energy(self, x: torch.Tensor, model: torch.nn.Module) -> [torch.Tensor, torch.Tensor]:
+        '''
+        Compute local energy values of Hamiltonian w.r.t. batch of qubit spin configurations and an ansatz model
+        Args:
+            x: qubit spin configurations
+            model: NQS ansatz
+        Returns:
+            local_energy: local energy values (detached from computational graph)
+            log_psi: logarithms of ansatz statevector entries (attached to computational graph)
+        '''
+        # see appendix B of https://arxiv.org/pdf/1909.12852.pdf
+        # x [bs, input_dim]
+        bs = x.shape[0]
+        process_bs = int(np.ceil(self.unique_num_terms*bs/self.flip_bs))
+        # first obtain model output for input batch # [bs, 2]
+        log_psi = model(x) # [bs, 2]
+        with torch.no_grad():
+            # log_psi_k comprises model outputs corresponding to unique flips of the Hamiltonian for each batch sample # [bs*unique_num_terms, 2]
+            log_psi_k = torch.zeros(bs*self.unique_num_terms, 2).to(x.device)
+            x_k = (x.unsqueeze(1) * self.unique_flips).reshape(-1, self.input_dim) # [bs*unique_num_terms, input_dim]
+            # further batching is done to conserve GPU memory footprint
+            model.inference = True
+            flip_batches = np.ceil(x_k.shape[0]/process_bs).astype(np.int64)
+            for i in range(flip_batches):
+                log_psi_k[process_bs*i:process_bs*(i + 1)] = model(x_k[process_bs*i:process_bs*(i + 1)])
+            model.inference = False
+        if len(log_psi.shape) == 1: # if not complex
+            log_psi_k = torch.stack([log_psi_k, torch.zeros_like(log_psi_k).to(log_psi_k.device)], dim=-1)
+            log_psi = torch.stack([log_psi, torch.zeros_like(log_psi).to(log_psi.device)], dim=-1)
+        log_psi_k = log_psi_k.reshape(bs, self.unique_num_terms, 2) # [bs, unique_num_terms, 2]
+        log_psi_k = log_psi_k[:, self.unique_indices] # [bs, num_terms, 2]
+        ratio = exp(log_psi_k-log_psi.unsqueeze(1)).detach() # [bs, num_terms, 2]
+        # compute matrix element
+        # Eq. B3
+        part2 = (x.unsqueeze(1).repeat(1, self.num_terms, 1) * self.select_idx.unsqueeze(0) + (1-self.select_idx).unsqueeze(0)).prod(-1) # [bs, num_terms]
+        mtx_k = torch.stack((part2, torch.zeros_like(part2)), dim=-1)
+        # total local energy
+        local_energy = scalar_mult(self.coefficients.unsqueeze(0), scalar_mult(mtx_k, ratio)).sum(1) # [bs, 2]
+        return local_energy.detach(), log_psi
+
+    def set_device(self, device: str):
+        '''
+        Sets device of all relevant object tensors to 'device'
+        '''
+        self.coefficients = self.coefficients.to(device)
+        self.num_pauli_y = self.num_pauli_y.to(device)
+        self.select_idx = self.select_idx.to(device)
+        self.unique_flips = self.unique_flips.to(device)
+        self.unique_indices = self.unique_indices.to(device)
diff --git a/examples/neural_quantum_states/src/objective/hamiltonian.py b/examples/neural_quantum_states/src/objective/hamiltonian.py
new file mode 100644
index 0000000..9d0bcfe
--- /dev/null
+++ b/examples/neural_quantum_states/src/objective/hamiltonian.py
@@ -0,0 +1,88 @@
+import numpy as np
+import torch
+import torch.nn as nn
+from src.complex import scalar_mult, real, imag
+
+class Hamiltonian(nn.Module):
+    def __init__(self, hamiltonian_string, num_sites):
+        super().__init__()
+        self.operators, self.coefficients = self.parse_hamiltonian_string(hamiltonian_string, num_sites)
+        self.num_terms, self.input_dim = self.operators.shape
+        print("Number of terms is {}.".format(self.num_terms))
+
+    def forward(self, config: torch.Tensor, model: nn.Module) -> [torch.Tensor, torch.Tensor]:
+        # Wrapper function for self.compute_local_energy
+        return self.compute_local_energy(config, model)
+
+    def set_device(self, device: str): # Sets devices of all Hamiltonian attributes to 'device'
+        pass
+
+    def compute_local_energy(self, config: torch.Tensor, model: nn.Module) -> [torch.Tensor, torch.Tensor]:
+        '''
+        Template for computing the local energies of the Hamiltonian using a given set of spin configurations and a given model
+        Args:
+        - config: tensor batch of spin configurations
+        - model: ansatz model
+        Returns:
+        - local_energies: corresponding batch of local energies, detached from model computational graph
+        - log_psi: corresponding batch of model outputs, attached to model computational graph
+        '''
+        pass
+
+    def parse_hamiltonian_string(self, hamiltonian_string: str, num_sites: int) -> [np.ndarray, np.ndarray]:
+        '''
+        Converts string encoding Hamiltonian (in Pauli string basis) to separate arrays containing Hamiltonian strings (encoded as integer vectors) and their corresponding scalar coefficients
+        Args:
+            hamiltonian_string: Pauli Hamiltonian represented as a string
+            num_sites: Number of qubits in Hamiltonian system
+        Returns:
+            hmtn_ops: integer array representing Pauli strings in Hamiltonian
+            params: scalar parameters
+        '''
+        splitted_string = hamiltonian_string.split('+\n')
+        num_terms = len(splitted_string)
+        params = np.zeros([num_terms]).astype(np.complex128)
+        hmtn_ops = np.zeros([num_terms, num_sites])
+        for i,term in enumerate(splitted_string):
+            params[i] = complex(term.split(' ')[0])
+            ops = term[term.index('[')+1:term.index(']')]
+            ops_lst = ops.split(' ')
+            for op in ops_lst:
+                if op == '':
+                    continue
+                pauli_type = op[0]
+                idx = int(op[1:])
+                if pauli_type == 'X':
+                    encoding = 1
+                elif pauli_type == 'Y':
+                    encoding = 2
+                elif pauli_type == 'Z':
+                    encoding = 3
+                elif pauli_type == 'I':
+                    encoding = 0
+                else:
+                    raise "Unknown pauli_type!"
+                hmtn_ops[i, idx] = encoding
+        return torch.tensor(hmtn_ops).int(), torch.tensor(params)
+
+def get_hamiltonian(hamiltonian_choice: str, hamiltonian_data: dict) -> nn.Module:
+    """
+    Returns an instance of Hamiltonian based on the choice of Hamiltonian and additional parameters.
+    Args:
+        hamiltonian_choice: Choice of hamiltonian type, default is 'exact' a.k.a. 'Automatic'.
+        hamiltonian_data: Dictionary of parameters specific to each Hamiltonian type
+    Returns:
+        Hamiltonian: An instance of Hamiltonian.
+    """
+    if hamiltonian_choice in ['sample']:
+        from .naive_sampler import NaiveSampler
+        return NaiveSampler(**hamiltonian_data)
+    elif hamiltonian_choice in ['adaptive_shadows']:
+        from .adaptive_shadows import AdaptiveShadows
+        return AdaptiveShadows(**hamiltonian_data)
+    elif hamiltonian_choice in ['exact']:
+        from .automatic import Automatic
+        return Automatic(**hamiltonian_data)
+    else:
+        raise Exception('Hamiltonian choice not recognized!')
+
diff --git a/examples/neural_quantum_states/src/objective/naive_sampler.py b/examples/neural_quantum_states/src/objective/naive_sampler.py
new file mode 100644
index 0000000..0c0ed6a
--- /dev/null
+++ b/examples/neural_quantum_states/src/objective/naive_sampler.py
@@ -0,0 +1,192 @@
+import numpy as np
+import torch
+
+from src.complex import exp, scalar_mult
+from .hamiltonian import Hamiltonian
+
+
+class NaiveSampler(Hamiltonian):
+    def __init__(self, hamiltonian_string: str, num_sites: int, sample_count: int, total_unique_samples: int, reset_prob: float, flip_bs: int, **kwargs):
+        '''
+        A variation of the Automatic class that stochastically estimates the input Hamiltonian with Pauli strings sampled from the distribution proportional to the absolute values of the scalar coefficients (simple to construct because the Pauli string coefficients are real for Hamiltonians). This estimated Hamiltonian can be used to create local energy estimates during NQS training for (ideally) lower computational cost.
+        '''
+        super(NaiveSampler, self).__init__(hamiltonian_string, num_sites)
+        self.flip_bs = flip_bs
+        # product of identity operators by default, encoded as 0
+        self.coefficients = torch.stack((self.coefficients.real, self.coefficients.imag), dim=-1)
+        # find index of pauli X,Y,Z operators
+        pauli_x_idx = (self.operators==1).int() # [num_terms, input_dim]
+        pauli_y_idx = (self.operators==2).int() # [num_terms, input_dim]
+        pauli_z_idx = (self.operators==3).int() # [num_terms, input_dim]
+        del self.operators
+        # create the probability tensor and change the coefficient tensor to a sign tensor
+        self.probabilities = torch.abs(self.coefficients[:,0])
+        self.norm_constant = torch.sum(self.probabilities)
+        self.probabilities = self.probabilities/self.norm_constant
+        self.coefficients = torch.sign(self.coefficients)
+        # track the exponential of -i
+        self.num_pauli_y = pauli_y_idx.sum(-1) # [num_terms]
+        part1 = (-1j)**self.num_pauli_y.detach().cpu().numpy()
+        part1 = torch.stack((torch.tensor(part1.real), torch.tensor(part1.imag)), dim=-1).float()
+        self.coefficients = scalar_mult(self.coefficients, part1)
+        del part1
+        # the unique element has flipped value if the corresponding pauli is x or y.
+        flip_idx = pauli_x_idx + pauli_y_idx # [num_terms, input_dim]
+        # self.flip_idx = flip_idx
+        del pauli_x_idx
+        # only the entry value with y or z pauli is multiplied
+        self.select_idx = pauli_y_idx + pauli_z_idx
+        del pauli_y_idx
+        del pauli_z_idx
+        unique_flips, unique_indices = np.unique(np.array(flip_idx), axis=0, return_inverse=True)
+        self.unique_flips = torch.tensor(unique_flips)
+        self.unique_indices = torch.tensor(unique_indices)
+        self.unique_num_terms = self.unique_flips.shape[0]
+        self.sampler = Alias_Sampler(self.probabilities, total_unique_samples)
+        self.sample_count = sample_count
+        self.generate_sample_hamiltonian()
+        self.reset_prob = reset_prob
+
+    def generate_sample_hamiltonian(self, device='cpu'):
+        '''
+        Generates a new stochastic estimate of the Hamiltonian from coefficient distribution
+        Args:
+            device: device on which to place relevant attribute tensors
+        '''
+        samples, counts = self.sampler(self.sample_count, device)
+        counts = (counts/torch.sum(counts)).to(self.coefficients.device) # convert counts to weights
+        self.coefficients_batch = self.coefficients[samples]*(counts.reshape(-1,1))
+        self.select_idx_batch = self.select_idx[samples]
+        flips = self.unique_flips[self.unique_indices[samples]]
+        self.unique_flips_batch, self.unique_indices_batch = torch.unique(flips, return_inverse=True, dim=0)
+        self.unique_flips_batch = 1 - 2*self.unique_flips_batch.unsqueeze(0)
+        self.unique_num_terms_batch = self.unique_flips_batch.shape[1]
+        self.num_terms_batch = self.coefficients_batch.shape[0]
+    
+    def compute_local_energy(self, x, model):
+        '''
+        Compute local energy values of estimated Hamiltonian w.r.t. batch of qubit spin configurations and an ansatz model. Randomly resamples the Hamiltonian estimate.
+        Args:
+            x: qubit spin configurations
+            model: NQS ansatz
+        Returns:
+            local_energy: local energy values (detached from computational graph)
+            log_psi: logarithms of ansatz statevector entries (attached to computational graph)
+        '''
+        # see appendix B of https://arxiv.org/pdf/1909.12852.pdf
+        # x [bs, input_dim]
+        bs = x.shape[0]
+        # Randomly regenerate sample Hamiltonian batch
+        if torch.rand(1) < self.reset_prob:
+            self.generate_sample_hamiltonian(model.device)
+        process_bs = int(np.ceil(self.unique_num_terms_batch*bs/self.flip_bs))
+        # first obtain model output for input batch # [bs, 2]
+        log_psi = model(x)
+        with torch.no_grad():
+            # log_psi_k comprises model outputs corresponding to unique flips of Hamiltonian batch for eachn sample [bs*unique_num_terms_batch, 2]
+            log_psi_k = torch.zeros(bs*self.unique_num_terms_batch, 2).to(x.device)
+            x_k = (x.unsqueeze(1) * self.unique_flips_batch).reshape(-1, self.input_dim) # [bs*unique_num_terms_batch, input_dim]
+            # further batching is done to conserve GPU memory footprint
+            for i in range(self.flip_bs):
+                log_psi_k[process_bs*i:process_bs*(i + 1)] = model(x_k[process_bs*i:process_bs*(i + 1)])
+        if len(log_psi.shape) == 1: # if not complex
+            log_psi_k = torch.stack([log_psi_k, torch.zeros_like(log_psi_k).to(log_psi_k.device)], dim=-1)
+            log_psi = torch.stack([log_psi, torch.zeros_like(log_psi).to(log_psi.device)], dim=-1)
+        log_psi_k = log_psi_k.reshape(bs, self.unique_num_terms_batch, 2) # [bs, unique_num_terms_batch, 2]
+        log_psi_k = log_psi_k[:, self.unique_indices_batch] # [bs, num_terms_batch, 2]
+        ratio = exp(log_psi_k-log_psi.unsqueeze(1)).detach() # [bs, num_terms_batch, 2]
+        # compute matrix element
+        # Eq. B3
+        part2 = (x.unsqueeze(1).repeat(1, self.num_terms_batch, 1) * self.select_idx_batch.unsqueeze(0) + (1-self.select_idx_batch).unsqueeze(0)).prod(-1) # [bs, num_terms]
+        mtx_k = torch.stack((part2, torch.zeros_like(part2)), dim=-1)
+        # total local energy
+        local_energy = scalar_mult(self.coefficients_batch.unsqueeze(0), scalar_mult(mtx_k, ratio)).sum(1) # [bs, 2]
+        return self.norm_constant*local_energy.detach(), log_psi
+
+    def set_device(self, device):
+        '''
+        Sets device of all relevant object tensors to 'device'
+        '''
+        self.coefficients = self.coefficients.to(device)
+        self.num_pauli_y = self.num_pauli_y.to(device)
+        self.select_idx = self.select_idx.to(device)
+        self.unique_flips = self.unique_flips.to(device)
+        self.unique_indices = self.unique_indices.to(device)
+        self.coefficients_batch = self.coefficients_batch.to(device)
+        self.select_idx_batch = self.select_idx_batch.to(device)
+        self.unique_flips_batch = self.unique_flips_batch.to(device)
+        self.unique_indices_batch = self.unique_indices_batch.to(device)
+
+class Alias_Sampler:
+    '''
+    A generic sampler module based on Walker's Alias Method: https://en.wikipedia.org/wiki/Alias_method
+    Generates lookup tables for a given probability vector, allowing for constant time sampling
+    '''
+    def __init__(self, probabilities: torch.Tensor, num_unique: int):
+        self.U_table, self.V_table, self.num_outcomes = self.generate_alias_tables(probabilities)
+        self.num_unique = num_unique
+
+    def __call__(self, num_samples: int, device: str='cpu') -> [torch.Tensor, torch.Tensor]:
+        '''
+        Use internal lookup tables to generate 'num_samples' nonunique samples
+        Args:
+            num_samples: number of nonunique samples to generate
+        Returns:
+            unique_samples: unique sampled Pauli strings
+            counts: number of times each unique sample occurred in sampling
+        '''
+        uniform_samples = torch.rand(num_samples)
+        intermediate_product = self.num_outcomes*uniform_samples
+        samples = torch.floor(intermediate_product).int()
+        tail = intermediate_product - samples
+        for i in range(num_samples):
+            if tail[i] < self.U_table[samples[i]]:
+                continue
+            else:
+                samples[i] = self.V_table[samples[i]]
+        unique_samples, counts = torch.unique(samples, sorted=False, return_counts=True)
+        counts, indices = torch.sort(counts, descending = True)
+        unique_samples = unique_samples[indices]
+        if unique_samples.shape[0] > self.num_unique:
+            return unique_samples[:self.num_unique], counts[:self.num_unique]
+        else:
+            return unique_samples.to(device), counts.to(device)
+
+    def generate_alias_tables(self, probabilities: torch.Tensor) -> [torch.Tensor, torch.Tensor, int]:
+        '''
+        Generate necessary Alias method lookup tables from input probability vector
+        Args:
+            probabilities: input probability vector
+        Returns:
+            U_table: probability tables utilized by method
+            V_table: alias tables utilized by method
+            probabilities.shape[0]: length of probability vector
+        '''
+        U_table = probabilities.shape[0]*probabilities
+        V_table = -1*torch.ones(probabilities.shape[0], dtype=torch.int)
+        overfull = []
+        underfull = []
+        for i in range(len(U_table)):
+            if U_table[i] > 1:
+                overfull.append(i)
+            elif U_table[i] < 1 and V_table[i] == -1:
+                underfull.append(i)
+            else:
+                continue
+        while len(overfull) > 0 and len(underfull) > 0:
+            i = overfull.pop(0)
+            j = underfull.pop(0)
+            V_table[j] = i
+            U_table[i] = U_table[i] + U_table[j] - 1
+            if U_table[i] > 1:
+                overfull.append(i)
+            elif U_table[i] < 1 and V_table[i] == -1:
+                underfull.append(i)
+            else:
+                continue
+        if len(overfull) > 0 or len(underfull) > 0:
+            extras = overfull + underfull
+            while len(extras) > 0:
+                i = extras.pop(0)
+                U_table[i] = 1
+        return U_table, V_table, probabilities.shape[0]
diff --git a/examples/neural_quantum_states/src/training/__init__.py b/examples/neural_quantum_states/src/training/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/examples/neural_quantum_states/src/training/__pycache__/__init__.cpython-312.pyc b/examples/neural_quantum_states/src/training/__pycache__/__init__.cpython-312.pyc
new file mode 100644
index 0000000..4ceca19
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/__init__.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/training/__pycache__/__init__.cpython-39.pyc b/examples/neural_quantum_states/src/training/__pycache__/__init__.cpython-39.pyc
new file mode 100644
index 0000000..09f43f2
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/__init__.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/training/__pycache__/evaluate.cpython-312.pyc b/examples/neural_quantum_states/src/training/__pycache__/evaluate.cpython-312.pyc
new file mode 100644
index 0000000..0a48f24
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/evaluate.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/training/__pycache__/evaluate.cpython-39.pyc b/examples/neural_quantum_states/src/training/__pycache__/evaluate.cpython-39.pyc
new file mode 100644
index 0000000..dbcbb39
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/evaluate.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/training/__pycache__/scheduler.cpython-39.pyc b/examples/neural_quantum_states/src/training/__pycache__/scheduler.cpython-39.pyc
new file mode 100644
index 0000000..f6b4cd9
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/scheduler.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/training/__pycache__/train.cpython-312.pyc b/examples/neural_quantum_states/src/training/__pycache__/train.cpython-312.pyc
new file mode 100644
index 0000000..8a8ca04
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/train.cpython-312.pyc differ
diff --git a/examples/neural_quantum_states/src/training/__pycache__/train.cpython-39.pyc b/examples/neural_quantum_states/src/training/__pycache__/train.cpython-39.pyc
new file mode 100644
index 0000000..99858d9
Binary files /dev/null and b/examples/neural_quantum_states/src/training/__pycache__/train.cpython-39.pyc differ
diff --git a/examples/neural_quantum_states/src/training/evaluate.py b/examples/neural_quantum_states/src/training/evaluate.py
new file mode 100644
index 0000000..2d3752d
--- /dev/null
+++ b/examples/neural_quantum_states/src/training/evaluate.py
@@ -0,0 +1,92 @@
+import numpy as np
+import torch
+import torch.distributed as dist
+
+from src.complex import real, imag, exp, scalar_mult, conjugate
+
+def test(model: torch.nn.Module, hamiltonian: torch.nn.Module, batch_size: int, num_samples: int, mini_bs: int, global_rank: int, world_size: int) -> [torch.Tensor, float, float, int, torch.Tensor]:
+    '''
+    Estimates the expected value of the problem Hamiltonian in the state encoded by the NQS ansatz through parallelized autoregressive sampling
+    Args:
+        model: NQS ansatz
+        hamiltonian: Problem Hamiltonian
+        batch_size: maximum number of unique samples per GPU for sampling
+        num_samples: number of non-unique samples to use for expectation estimation
+        mini_bs: largest number of unique samples each GPU is expected to process at one time
+        global_rank: identifying index of each GPU across all GPUs
+        world_size: total number of GPUs
+    Returns:
+        mean: Sample mean estimating Hamiltonian expectation value
+        score: Real part of sample mean (Imaginary part should average to zero; fluctuations ignored for ease of calculation
+        std: Score root of sample variance
+        num_uniq: number of unique samples obtained
+        entropy: sample mean estimating entropy of ansatz distribution
+    '''
+    device = list(model.parameters())[0].device
+    model.eval()
+    with torch.no_grad():
+        # Generate samples
+        if world_size > 1:
+            samples = model.module.sample(batch_size*world_size, num_samples)
+        else:
+            samples = model.sample(batch_size*world_size, num_samples)
+        if samples[0].shape[0] < world_size:
+            repeat_count = np.ceil(world_size/samples[0].shape[0]).astype(np.int64)
+            samples[0] = samples[0].repeat(repeat_count,1)
+            samples[1] = samples[1].repeat(repeat_count)
+        if samples[0].shape[0] < batch_size*world_size:
+            batch_size = np.ceil(samples[0].shape[0] / world_size).astype(np.int64)
+            if batch_size * (world_size - 1) >= samples[0].shape[0]:
+                batch_size -= 1
+        if isinstance(samples, list):
+            samples, count = samples
+            weight = count / count.sum()
+        else:
+            weight = 1 / samples.shape[0] * torch.ones([samples.shape[0]])
+        if isinstance(samples, (np.ndarray, np.generic)):
+            samples = torch.tensor(samples).float().to(device)
+        num_uniq = samples.shape[0]
+        partition = world_size - global_rank - 1
+        if global_rank == 0:
+            samples = samples[partition*batch_size:]
+            weight = weight[partition*batch_size:]
+        else:
+            samples = samples[partition*batch_size:(partition+1)*batch_size]
+            weight = weight[partition*batch_size:(partition+1)*batch_size]
+        # Calculate number of inner iterations to process all samples
+        inner_iter = torch.tensor(np.ceil(samples.shape[0]/mini_bs).astype(np.int64)).to(global_rank)
+        if world_size > 1:
+            dist.all_reduce(inner_iter, op=dist.ReduceOp.MAX)
+        sbs = torch.ceil(samples.shape[0]/inner_iter).int()
+        scores = []
+        log_psi_batch = []
+        # Calculate local energy values in minibatches
+        for i in range(inner_iter):
+            if i*sbs < samples.shape[0]:
+                score, log_psi = hamiltonian.compute_local_energy(samples[i*sbs:(i+1)*sbs], model)
+                scores.append(score.float())
+                log_psi_batch.append(log_psi.float())
+            else:
+                continue
+        scores = torch.cat(scores, dim=0)
+        weight = weight.to(scores.device)
+        log_psi_batch = torch.cat(log_psi_batch, dim=0)
+        if world_size > 1: # Need to collect calculated values in case of GPU parallelization
+            mean = (scores * weight.unsqueeze(-1)).sum(0)
+            dist.all_reduce(mean, op=dist.ReduceOp.SUM)
+            score = real(mean)
+            std = ((real(scores) - score)**2 * weight).sum()
+            dist.all_reduce(std, op=dist.ReduceOp.SUM)
+            std = std.sqrt()
+            averaged_log_psi = (log_psi_batch * weight.unsqueeze(-1)).sum(0)
+            dist.all_reduce(averaged_log_psi, op=dist.ReduceOp.SUM)
+            averaged_log_psi[1]=0.0
+            entropy = -2*averaged_log_psi
+        else:
+            mean = (scores * weight.unsqueeze(-1)).sum(0)
+            score = real(mean)
+            std = ((real(scores) - score)**2 * weight).sum().sqrt()
+            averaged_log_psi = (log_psi_batch * weight.unsqueeze(-1)).sum(0)
+            averaged_log_psi[1]=0.0
+            entropy = -2*averaged_log_psi
+    return mean, score.item(), std.item(), num_uniq, entropy
diff --git a/examples/neural_quantum_states/src/training/scheduler.py b/examples/neural_quantum_states/src/training/scheduler.py
new file mode 100644
index 0000000..0dd75a4
--- /dev/null
+++ b/examples/neural_quantum_states/src/training/scheduler.py
@@ -0,0 +1,263 @@
+import warnings
+from collections import Counter
+from torch.optim.lr_scheduler import _LRScheduler
+from torch.optim import lr_scheduler
+import torch
+import math
+from typing import Callable
+
+class VNAScheduler: # A custom scheduler class for scheduling the regularization parameter for variational neural annealing (VNA)
+    def __init__(self, num_epochs: int, initial_constant: float=1.0, schedule_choice: str='none'):
+        self.num_epochs = num_epochs
+        self.initial_constant = initial_constant
+        self.get_constant = self.get_scheduler(schedule_choice)
+
+    def __call__(self, epoch: int) -> float: # Wrapper function of getting the regularization constant at 'epoch' iterations
+        return self.get_constant(epoch)
+
+    def polynomial_scheduler(self, epoch: int) -> float: # Annealing is polynomial with respect to training fraction
+        train_frac = epoch/self.num_epochs
+        if train_frac <=self.decay_start:
+            return self.initial_constant
+        elif train_frac > self.decay_end:
+            return 0.0
+        else:
+            return ((train_frac - self.decay_end)/(self.decay_start - self.decay_end))**self.power
+
+    def exponential_scheduler(self, epoch: int) -> float: # Annealing is exponential with respect to training fraction
+        return self.initial_constant*self.decay_rate**epoch
+
+    def cosine_scheduler(self, epoch: int) -> float: # Annealing decreases as a cosine wave with respect to training fraction
+        eps = 0.0
+        train_frac = epoch/self.num_epochs
+        if train_frac <= self.decay_start:
+            return self.initial_constant + eps
+        elif train_frac >= self.decay_end:
+            return eps
+        else:
+            return self.initial_constant*0.5*(1.0 + math.cos((math.pi/(self.decay_end - self.decay_start))*(train_frac - self.decay_start))) + eps
+    
+    def cosine_tent_scheduler(self, epoch: int) -> float: # Annealing increases and then decreases as a symmetric cosine curve
+        train_frac = epoch/self.num_epochs
+        if train_frac <= self.tent_start or train_frac >= self.tent_end:
+            return 0.0
+        else:
+            return self.initial_constant*0.5*(1.0 - math.cos((2*math.pi/(self.tent_end - self.tent_start))*(train_frac - self.tent_start)))
+    
+    def trapezoid_scheduler(self, epoch: int) -> float: # Annealing increases linearly, holds constant, and then decreases linearly
+        train_frac = epoch/self.num_epochs
+        rate = min(1.0,(train_frac - self.up_start)/(self.up_end - self.up_start), (train_frac - self.down_end)/(self.down_end - self.down_start))
+        return max(0.0,self.initial_constant*rate)
+    
+    def cutoff_scheduler(self, epoch: int) -> float: # Annealing holds constant and then vanishes instantly
+        if epoch/self.num_epochs <= self.cutoff:
+            return self.initial_constant
+        else:
+            return 0.0
+
+    def get_scheduler(self, choice: str) -> Callable: # Sets scheduler function and assigns relevant attributes
+        if choice in ['none']:
+            self.cutoff = -1.0
+            return self.cutoff_scheduler
+        elif choice in ['const']:
+            self.cutoff = 2.0
+            return self.cutoff_scheduler
+        elif choice in ['cutoff']:
+            self.cutoff = 0.6
+            return self.cutoff_scheduler
+        elif choice in ['exponential']:
+            self.decay_rate = 0.00001**(1/self.num_epochs)
+            return self.exponential_scheduler
+        elif choice in ['cosine']:
+            self.decay_start = 0.05
+            self.decay_end = 0.90
+            return self.cosine_scheduler
+        elif choice in ['cos_tent']:
+            self.tent_start = 0.1
+            self.tent_end = 0.9
+            return self.cosine_tent_scheduler
+        elif choice in ['trapezoid']:
+            self.up_start = 0.1
+            self.up_end = 0.4
+            self.down_start = 0.6
+            self.down_end = 0.8
+            return self.trapezoid_scheduler
+        elif choice in ['linear', 'quadratic', 'cubic', 'quartic', 'quintic']:
+            self.decay_start = 0.05
+            self.decay_end = 1.0
+            if choice == 'linear':
+                self.power = 1
+            elif choice == 'quadratic':
+                self.power = 2
+            elif choice == 'cubic':
+                self.power = 3
+            elif choice == 'quartic':
+                self.power = 4
+            elif choice == 'quintic':
+                self.power = 5
+            return self.polynomial_scheduler
+
+class TrapezoidLR(_LRScheduler):
+    # Warm up before 1/4 epochs and cool down after 3/4 epochs
+    def __init__(self, optimizer, milestones, last_epoch=-1):
+        self.milestones = Counter(milestones)
+        super(TrapezoidLR, self).__init__(optimizer, last_epoch)
+
+    def get_lr(self):
+        if not self._get_lr_called_within_step:
+            warnings.warn("To get the last learning rate computed by the scheduler, "
+                          "please use `get_last_lr()`.", UserWarning)
+        return [self.piecewise(self.last_epoch, base_lr) for base_lr in self.base_lrs]
+
+    def piecewise(self, x, lr):
+        milestones = list(sorted(self.milestones.elements()))
+        # start with 1
+        x = x + 1
+        if x <= milestones[0]:
+            return lr/milestones[0]*x
+        elif (x <= milestones[1]) and (x > milestones[0]):
+            return lr
+        elif (x <= milestones[2]) and (x > milestones[1]):
+            return lr*(milestones[2]-x)/(milestones[2]-milestones[1])
+        else:
+            return 0
+
+# https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup/tree/master
+class CosineAnnealingWarmupRestarts(_LRScheduler):
+    """
+        optimizer (Optimizer): Wrapped optimizer.
+        first_cycle_steps (int): First cycle step size.
+        cycle_mult(float): Cycle steps magnification. Default: -1.
+        max_lr(float): First cycle's max learning rate. Default: 0.1.
+        min_lr(float): Min learning rate. Default: 0.001.
+        warmup_steps(int): Linear warmup step size. Default: 0.
+        gamma(float): Decrease rate of max learning rate by cycle. Default: 1.
+        last_epoch (int): The index of last epoch. Default: -1.
+    """
+    
+    def __init__(self,
+                 optimizer : torch.optim.Optimizer,
+                 first_cycle_steps : int,
+                 cycle_mult : float = 1.,
+                 max_lr : float = 0.1,
+                 min_lr : float = 0.001,
+                 warmup_steps : int = 0,
+                 gamma : float = 1.,
+                 last_epoch : int = -1
+        ):
+        assert warmup_steps < first_cycle_steps
+        
+        self.first_cycle_steps = first_cycle_steps # first cycle step size
+        self.cycle_mult = cycle_mult # cycle steps magnification
+        self.base_max_lr = max_lr # first max learning rate
+        self.max_lr = max_lr # max learning rate in the current cycle
+        self.min_lr = min_lr # min learning rate
+        self.warmup_steps = warmup_steps # warmup step size
+        self.gamma = gamma # decrease rate of max learning rate by cycle
+        
+        self.cur_cycle_steps = first_cycle_steps # first cycle step size
+        self.cycle = 0 # cycle count
+        self.step_in_cycle = last_epoch # step size of the current cycle
+        
+        super(CosineAnnealingWarmupRestarts, self).__init__(optimizer, last_epoch)
+        
+        # set learning rate min_lr
+        self.init_lr()
+    
+    def init_lr(self):
+        self.base_lrs = []
+        for param_group in self.optimizer.param_groups:
+            param_group['lr'] = self.min_lr
+            self.base_lrs.append(self.min_lr)
+    
+    def get_lr(self):
+        if self.step_in_cycle == -1:
+            return self.base_lrs
+        elif self.step_in_cycle < self.warmup_steps:
+            return [(self.max_lr - base_lr)*self.step_in_cycle / self.warmup_steps + base_lr for base_lr in self.base_lrs]
+        else:
+            return [base_lr + (self.max_lr - base_lr) \
+                    * (1 + math.cos(math.pi * (self.step_in_cycle-self.warmup_steps) \
+                                    / (self.cur_cycle_steps - self.warmup_steps))) / 2
+                    for base_lr in self.base_lrs]
+
+    def step(self, epoch=None):
+        if epoch is None:
+            epoch = self.last_epoch + 1
+            self.step_in_cycle = self.step_in_cycle + 1
+            if self.step_in_cycle >= self.cur_cycle_steps:
+                self.cycle += 1
+                self.step_in_cycle = self.step_in_cycle - self.cur_cycle_steps
+                self.cur_cycle_steps = int((self.cur_cycle_steps - self.warmup_steps) * self.cycle_mult) + self.warmup_steps
+        else:
+            if epoch >= self.first_cycle_steps:
+                if self.cycle_mult == 1.:
+                    self.step_in_cycle = epoch % self.first_cycle_steps
+                    self.cycle = epoch // self.first_cycle_steps
+                else:
+                    n = int(math.log((epoch / self.first_cycle_steps * (self.cycle_mult - 1) + 1), self.cycle_mult))
+                    self.cycle = n
+                    self.step_in_cycle = epoch - int(self.first_cycle_steps * (self.cycle_mult ** n - 1) / (self.cycle_mult - 1))
+                    self.cur_cycle_steps = self.first_cycle_steps * self.cycle_mult ** (n)
+            else:
+                self.cur_cycle_steps = self.first_cycle_steps
+                self.step_in_cycle = epoch
+                
+        self.max_lr = self.base_max_lr * (self.gamma**self.cycle)
+        self.last_epoch = math.floor(epoch)
+        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
+            param_group['lr'] = lr
+
+# https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/6b5e8953a80aef5b324104dc0c2e9b8c34d622bd/warmup_scheduler/scheduler.py#L5
+class GradualWarmupScheduler(_LRScheduler):
+    def __init__(self, optimizer, total_epoch, after_scheduler):
+        self.total_epoch = total_epoch
+        self.after_scheduler = after_scheduler
+        self.finished = False
+        super(GradualWarmupScheduler, self).__init__(optimizer)
+
+    def get_lr(self):
+        if self.last_epoch > self.total_epoch:
+            if self.after_scheduler:
+                if not self.finished:
+                    self.after_scheduler.base_lrs = [base_lr for base_lr in self.base_lrs]
+                    self.finished = True
+                return self.after_scheduler.get_last_lr()
+            return [base_lr for base_lr in self.base_lrs]
+        return [base_lr * (float(self.last_epoch) / self.total_epoch) for base_lr in self.base_lrs]
+
+    def step(self, epoch=None):
+        if self.finished and self.after_scheduler:
+            if epoch is None:
+                self.after_scheduler.step(None)
+            else:
+                self.after_scheduler.step(epoch - self.total_epoch)
+            self._last_lr = self.after_scheduler.get_last_lr()
+        else:
+            return super(GradualWarmupScheduler, self).step(epoch)
+
+
+def get_scheduler(sche_name: str, optimizer: torch.optim.Optimizer, learning_rate: float, epochs: int, min_rate: float=5e-6) -> _LRScheduler:
+    if sche_name == 'decay':
+        scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[int(epochs/7*3), int(epochs/7*5)], gamma=0.1)
+        # warm up at the first 1/10 of total epochs
+        scheduler = GradualWarmupScheduler(optimizer, int(epochs/10), scheduler)
+    elif sche_name == 'cosine':
+        scheduler = CosineAnnealingWarmupRestarts(optimizer, first_cycle_steps=int(epochs), max_lr=learning_rate, min_lr=5e-5, warmup_steps=0, gamma=1.0)
+    elif sche_name == 'cosine_warmup':
+        scheduler = CosineAnnealingWarmupRestarts(optimizer, first_cycle_steps=int(0.96*epochs), max_lr=learning_rate, min_lr=5e-8, warmup_steps=int(epochs*0.04), gamma=1.0)
+    elif sche_name == 'cosine_warmrestart':
+        scheduler = CosineAnnealingWarmupRestarts(optimizer, first_cycle_steps=int(epochs*0.1), cycle_mult=1.3310577386373, max_lr=learning_rate, min_lr=5e-5, warmup_steps=int(epochs*0.04), gamma=0.8)
+    elif sche_name == 'cyclic':
+        scheduler = lr_scheduler.CyclicLR(optimizer, base_lr=(learning_rate/100),
+                                          max_lr=learning_rate, step_size_up=int(epochs*2/5),
+                                          step_size_down=int(epochs*3/5), cycle_momentum=False)
+    elif sche_name == 'trap':
+        scheduler = TrapezoidLR(optimizer, milestones=[int(epochs/4), int(epochs*3/4), epochs])
+    elif sche_name == 'const':
+        scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[int(epochs/2)], gamma=1.0)
+        # warm up for first 1% of total epochs
+        scheduler = GradualWarmupScheduler(optimizer, int(0.01*epochs), scheduler)
+    else:
+        raise "sche_name not recognized."
+    return scheduler
diff --git a/examples/neural_quantum_states/src/training/train.py b/examples/neural_quantum_states/src/training/train.py
new file mode 100644
index 0000000..65a0237
--- /dev/null
+++ b/examples/neural_quantum_states/src/training/train.py
@@ -0,0 +1,268 @@
+import os
+import time
+import logging
+from tqdm import trange
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.distributed as dist
+import argparse
+from torch.nn.parallel import DistributedDataParallel
+from tensorboardX import SummaryWriter
+from tangelo.algorithms.classical import CCSDSolver, FCISolver
+
+from src.data.data_loader import load_data
+from src.training.scheduler import get_scheduler, VNAScheduler
+from src.util import get_optimizer
+from src.models.base import get_model
+from src.complex import real, imag, conjugate, scalar_mult
+
+from .evaluate import test
+from ..objective.hamiltonian import get_hamiltonian
+
+def train_one_batch(model: nn.Module, hamiltonian: nn.Module, optimizer: torch.optim.Optimizer, scheduler: torch.optim.lr_scheduler._LRScheduler, batch_size: int, num_samples: int, mini_bs: int, global_rank: int, world_size: int, nmlzr: torch.Tensor, epoch: int, num_epochs: int, reg_const: float, entropy: torch.Tensor) -> dict:
+    '''
+    Calculates local energies, estimates model gradients, and performs one gradient update
+    Args:
+        model: NQS Ansatz
+        hamiltonian: Problem Hamiltonian
+        optimizer: choice of optimizer for training
+        scheduler: choice of learning rate scheduler for the optimizer
+        batch_size: maximum number of unique samples per GPU
+        num_samples: number of non-unique samples to generate in training
+        mini_bs: maximum number of unique samples to process on each GPU at one time
+        global_rank: global index among all GPUs
+        world_size: total number of GPUs
+        nmlzr: stochastic estimate of Hamiltonian expectation value
+        epoch: current training iteration
+        num_epochs: total number of training epochs
+        reg_const: regularization constant for neural annealing
+        entropy: stochastic estimate of ansatz entropy
+    Returns:
+        losses: dictionary containing loss value information for logging purposes
+    '''
+    model.train()
+    losses = {}
+    device = list(model.parameters())[0].device
+    # collect samples
+    if world_size > 1:
+        samples = model.module.sample(batch_size*world_size, num_samples)
+    else:
+        samples = model.sample(batch_size*world_size, num_samples)
+    if samples[0].shape[0] < world_size:
+        repeat_count = np.ceil(world_size/samples[0].shape[0]).astype(np.int64)
+        samples[0] = samples[0].repeat(repeat_count,1)
+        samples[1] = samples[1].repeat(repeat_count)
+    if samples[0].shape[0] < batch_size*world_size:
+        batch_size = np.ceil(samples[0].shape[0] / world_size).astype(np.int64)
+        if batch_size * (world_size - 1) >= samples[0].shape[0]:
+            batch_size -= 1
+    if isinstance(samples, list):
+        samples, count = samples
+        weight = count / count.sum()
+    else:
+        bs = samples.shape[0]
+        samples = torch.tensor(samples).float().to(device)
+        weight = torch.ones([bs]).to(samples.device) / bs
+    # get the corresponding batch for each GPU device
+    partition = world_size - global_rank - 1
+    if global_rank == 0:
+        samples = samples[partition*batch_size:]
+        weight = weight[partition*batch_size:]
+    else:
+        samples = samples[partition*batch_size:(partition+1)*batch_size]
+        weight = weight[partition*batch_size:(partition+1)*batch_size]
+    total_local_energies = []
+    total_losses = []
+    inner_iter = torch.tensor(np.ceil(samples.shape[0]/mini_bs).astype(np.int64)).to(global_rank)
+    if world_size > 1:
+        dist.all_reduce(inner_iter, op=dist.ReduceOp.MAX)
+    sbs = torch.ceil(samples.shape[0]/inner_iter).int()
+    # save the gradient of each small batch_size to disk and then sum them up for updates
+    for i in range(inner_iter):
+        if i*sbs >= samples.shape[0]:
+            continue
+        else:
+            sbs_samples = samples[i*sbs:(i+1)*sbs]
+            sbs_weight = weight[i*sbs:(i+1)*sbs]
+            # train
+            local_energies, log_psi = hamiltonian.compute_local_energy(sbs_samples, model)
+            log_psi_conj = conjugate(log_psi)
+            wt = sbs_weight.unsqueeze(-1)
+            nmlzr[1] = 0.0
+            if reg_const > 0.0:
+                loss_grad_diff = (local_energies - nmlzr).detach()
+                entropy_grad_sum = torch.zeros_like(log_psi).to(device)
+                entropy_grad_sum[:,0] = 1.0 + 2.0*log_psi[:,0]
+                loss = 2 * (real(scalar_mult(log_psi_conj, (loss_grad_diff + reg_const * (entropy_grad_sum + entropy)).detach()) * wt)).sum()
+            else:
+                loss = 2 * (real(scalar_mult(log_psi_conj, (local_energies - nmlzr).detach()) * wt)).sum()
+            loss.backward()
+            total_local_energies.append(local_energies)
+            total_losses.append(loss.item())
+    
+    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+    optimizer.step()
+    scheduler.step()
+    # print("Learning rate:", optimizer.param_groups[0]['lr'])
+    optimizer.zero_grad()
+    losses['loss'] = np.sum(total_losses)
+    local_energies = torch.cat(total_local_energies, axis=0)
+    mean_energy = (local_energies * weight.unsqueeze(-1)).sum(0)
+    if global_rank == 0:
+        score = real(mean_energy.detach())
+        scores = real(local_energies.detach())
+        std = ((scores - score)**2 * wt).sum().sqrt()
+        losses['score'] = score.item()
+        losses['std'] = std.item()
+        losses['num_uniq'] = samples.shape[0]
+    return losses
+
+def train(cfg: argparse.Namespace, local_rank: int, global_rank: int) -> [[float, float], float, dict]:
+    '''
+    Retrieves the model, hamiltonian, and all relevant submodules for training, then performs training according to user-specified settings
+    Args:
+        cfg: config flags specifying training settings
+        local_rank: identifying index among GPUs from the same compute node
+        global_rank: identifying index among all GPUs
+    Returns:
+        [best_mean, best_std**2, best_num_samples]: best expectation value estimate and associated sample variance and number of samples
+        time_elapsed: the amount of time that elapsed during training
+        dic: dictionary of loss values saved periodically during training
+    '''
+    # set hyper-parameters
+    logger_dir = cfg.MISC.DIR
+    save_dir = cfg.MISC.SAVE_DIR
+    device = torch.device('cuda:{}'.format(local_rank) if (cfg.DDP.WORLD_SIZE > 0) else 'cpu')
+    world_size = cfg.DDP.WORLD_SIZE
+    # train
+    lr = cfg.TRAIN.LEARNING_RATE
+    weight_decay = cfg.TRAIN.WEIGHT_DECAY
+    num_epochs = cfg.TRAIN.NUM_EPOCHS
+    bs = cfg.TRAIN.BATCH_SIZE
+    opt_name = cfg.TRAIN.OPTIMIZER_NAME
+    sche_name = cfg.TRAIN.SCHEDULER_NAME
+    retest_interval = cfg.TRAIN.RETEST_INTERVAL
+    entropy_scheduler = VNAScheduler(num_epochs, cfg.TRAIN.ANNEALING_COEFFICIENT, cfg.TRAIN.ANNEALING_SCHEDULER)
+    mini_bs = cfg.TRAIN.MINIBATCH_SIZE
+    # Maximum number of non-unique samples to generate during autoregressive sampling
+    max_num_samples = cfg.DATA.MAX_NUM_SAMPLES
+    # model
+    model_name = cfg.MODEL.MODEL_NAME
+    model_load_path = cfg.EVAL.MODEL_LOAD_PATH
+    made_depth = cfg.MODEL.MADE_DEPTH
+    made_width = cfg.MODEL.MADE_WIDTH
+    embedding_dim = cfg.MODEL.EMBEDDING_DIM
+    attention_heads = cfg.MODEL.ATTENTION_HEADS
+    feedforward_dim = cfg.MODEL.FEEDFORWARD_DIM
+    transformer_layers = cfg.MODEL.TRANSFORMER_LAYERS
+    temperature = cfg.MODEL.TEMPERATURE
+    # load data
+    data = load_data(cfg, global_rank, world_size)
+    num_sites = data['num_sites']
+    assert num_sites == 2*data['molecule'].n_active_mos
+    # num_spin_up/down records number of unoccupied orbitals instead of occupied ones, for legacy reasons
+    num_spin_up = data['molecule'].n_active_mos - (data['molecule'].mo_occ > 0).sum()
+    num_spin_down = data['molecule'].n_active_mos - (data['molecule'].mo_occ > 1).sum()
+    logging.info('Num of up/down spins {}/{}'.format(num_spin_up, num_spin_down))
+    # load model
+    model = get_model(model_name, device,
+                      print_model_info=True,
+                      num_sites=num_sites,
+                      made_width=made_width,
+                      made_depth=made_depth,
+                      embedding_dim = embedding_dim,
+                      nhead = attention_heads,
+                      dim_feedforward = feedforward_dim,
+                      num_layers = transformer_layers,
+                      num_spin_up=num_spin_up,
+                      num_spin_down=num_spin_down,
+                      temperature=temperature,
+                      )
+    if model_load_path:
+        model.load(model_load_path)
+    # set up training
+    choice = cfg.DATA.HAMILTONIAN_CHOICE
+    # Append Hamiltonian parameters to data dictionary
+    data.update({'sample_count': cfg.DATA.HAMILTONIAN_BATCH_SIZE, 'total_unique_samples': cfg.DATA.HAMILTONIAN_NUM_UNIQS, 'reset_prob': cfg.DATA.HAMILTONIAN_RESET_PROB, 'flip_bs': cfg.DATA.HAMILTONIAN_FLIP_BATCHES})
+    hamiltonian = get_hamiltonian(choice, data)
+    hamiltonian.set_device(device)
+    optimizer = get_optimizer(opt_name, model, lr, weight_decay)
+    scheduler = get_scheduler(sche_name, optimizer, lr, num_epochs)
+    if world_size > 1:
+        print("The local rank here is {}".format(local_rank))
+        model = DistributedDataParallel(model, device_ids=[local_rank])
+    # tensorboard
+    if global_rank == 0:
+        tensorboard = SummaryWriter(log_dir=logger_dir)
+        tensorboard.add_text(tag='argument', text_string=str(cfg.__dict__))
+        num_model_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+        if cfg.DATA.FCI:
+            fci_result = FCISolver(data['molecule']).simulate()
+        else:
+            fci_result = 'N/A'
+        print('Trainable Model Parameter Total: {}\nHartree--Fock Energy: {:8f}\nCCSD Energy: {:8f}\nFCI Energy: {}'.format(num_model_params, data['molecule'].mf_energy, CCSDSolver(data['molecule']).simulate(), fci_result))
+        tensorboard.add_text(tag='test_info', text_string='Qubits: {}, Electrons: {}, Hamiltonian Terms: {}, Hartree-Fock Energy: {:8f}'.format(num_sites, data['molecule'].n_electrons, hamiltonian.num_terms, data['molecule'].mf_energy))
+    # train
+    best_mean, best_std, best_num_samples = 0.0, 0.0, 1
+    time_elapsed = 0.0
+    dic = {'mean': [], 'std': [], 'num_uniq': [], 'entropy': [], 'num_samples': []}
+    if global_rank == 0:
+        progress_bar = trange(num_epochs, desc='Progress Bar', leave=True)
+    initial_samples = cfg.DATA.MIN_NUM_SAMPLES
+    sample_growth_ratio = (max_num_samples/initial_samples)**(1/(0.90*num_epochs))
+
+    for epoch in range(1, num_epochs + 1):
+        initial_samples *= sample_growth_ratio
+        num_samples = min(max_num_samples, int(initial_samples))
+        # evaluation
+        retest = (epoch == 1) or (epoch%retest_interval == 0) or (epoch == num_epochs) or (epoch/num_epochs > 0.90)
+        if retest:
+            nmlzr, mean, std, num_uniq, entropy = test(model, hamiltonian, bs, num_samples, mini_bs, global_rank, world_size)
+        # train
+        model.train()
+        start_time = time.time()
+        reg_const = entropy_scheduler(epoch)
+        losses = train_one_batch(model, hamiltonian, optimizer, scheduler, bs, num_samples, mini_bs, global_rank, world_size, nmlzr, epoch, num_epochs, reg_const, entropy)
+        end_time = time.time()
+        time_elapsed += end_time - start_time
+        if global_rank == 0:
+            for key in losses:
+                tensorboard.add_scalar('train/{}'.format(key), losses[key].real, epoch)
+        # log
+        message = '[Epoch {}]'.format(epoch)
+        for key in losses:
+            if key in ['loss']:
+                message += ' {}: {:.6f}'.format(key, abs(losses[key]))
+        entropy_fig = real(entropy).item()
+        if global_rank == 0 and retest:
+            dic['mean'].append(mean)
+            dic['std'].append(std)
+            dic['num_uniq'].append(num_uniq)
+            dic['entropy'].append(entropy_fig)
+            dic['num_samples'].append(num_samples)
+            tensorboard.add_scalar('test/mean', mean, epoch)
+            tensorboard.add_scalar('test/std', std, epoch)
+            tensorboard.add_scalar('test/num_uniq', num_uniq, epoch)
+            tensorboard.add_scalar('test/num_samples', num_samples, epoch)
+            tensorboard.add_scalar('test/entropy', entropy_fig, epoch)
+        message += ', mean/std/entropy: {:.6f}/{:.6f}/{:.6f}, {} uniqs'.format(mean, std, entropy_fig, num_uniq)
+        if (mean + 1.65*std/np.sqrt(num_samples) < best_mean) and retest:
+            best_mean = mean
+            best_std = std
+            best_num_samples = num_samples
+            if global_rank == 0:
+                if world_size > 1:
+                    model.module.save(os.path.join(save_dir, 'last_model.pth'))
+                else:
+                    model.save(os.path.join(save_dir, 'last_model.pth'))
+        if global_rank == 0:
+            progress_bar.set_description(message)
+            progress_bar.refresh() # to show immediately the update
+            progress_bar.update(1)
+        if epoch % int(num_epochs/5) == 0:
+            logging.info(message)
+    if global_rank == 0:
+        tensorboard.close()
+        print('Best Score Across Training: {}, Confidence Interval: +/-{}\nSample Standard Deviation: {}, Number of Samples: {}\nAverage Number of Uniques: {}'.format(best_mean, 1.96*best_std/np.sqrt(best_num_samples), best_std, best_num_samples, sum(dic['num_uniq'])/len(dic['num_uniq'])))
+    return [best_mean, best_std, best_num_samples], time_elapsed, dic
diff --git a/examples/neural_quantum_states/src/util.py b/examples/neural_quantum_states/src/util.py
new file mode 100644
index 0000000..a2dc9d5
--- /dev/null
+++ b/examples/neural_quantum_states/src/util.py
@@ -0,0 +1,99 @@
+import os
+import random
+import logging
+import numpy as np
+import torch
+import torch.optim as optim
+import torch.cuda as cuda
+import torch.distributed as dist
+import argparse
+from typing import Any
+
+def setup(global_rank: int, world_size: int, master_addr: str, master_port: int):
+    # Necessary setup steps to initialize DDP process group
+    os.environ['MASTER_ADDR'] = master_addr
+    os.environ['MASTER_PORT'] = str(master_port)
+    # initialize the process group
+    cuda.set_device(global_rank)
+    dist.init_process_group("nccl", rank=global_rank, world_size=world_size)
+
+def cleanup(): # Cleans up DDP process group
+    dist.destroy_process_group()
+
+def set_seed(seed: int):
+    # Sets random seeds for random, numpy, and torc
+    # Required for reproducibility and for accurate parallelization on multi-GPU setups, see https://pytorch.org/docs/master/notes/randomness.html
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.deterministic = True
+
+def folder_name_generator(cfg: argparse.Namespace, opts): # Generates names of logging folders based on config arguments
+    name_str = []
+    for i,arg in enumerate(opts):
+        if i % 2 == 1:
+            name_str.append('{}'.format(arg))
+    return '-'.join(name_str)
+
+def prepare_dirs(cfg: argparse.Namespace): # Prepares appropriate directories needed for training/logging
+    if not os.path.exists('./logger'):
+        os.makedirs('./logger')
+    if not os.path.exists(cfg.MISC.DIR):
+        os.makedirs(cfg.MISC.DIR)
+    # set logger
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(levelname)s] %(message)s",
+        handlers=[
+            logging.FileHandler(os.path.join(cfg.MISC.DIR, 'debug.log')),
+            logging.StreamHandler()
+        ]
+    )
+
+def write_file(file_name: str, content: Any, local_rank: int=0, overwrite: bool=False):
+    # Writes 'content' to file at 'file_name'
+    if local_rank == 0:
+        if overwrite:
+            f=open(file_name, "w+")
+        else:
+            f=open(file_name, "a+")
+        f.write(content)
+        f.write("\n")
+        f.close()
+
+def get_optimizer(opt_name: str, model: torch.nn.Module, learning_rate: float, weight_decay: float=0.0) -> optim.Optimizer:
+    '''
+    Retrieves an optimizer based on user selection
+    Args:
+        opt_name: name of optimizer choice
+        model: NQS ansatz
+        learning_rate: initial learning rate
+        weight_decay: weight decay parameter for optimizer
+    Returns:
+        optim.Optimizer: PyTorch Optimizer object
+    '''
+    if opt_name == 'adam':
+        optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999), eps=1e-08, weight_decay=weight_decay)
+    elif opt_name == 'sgd':
+        optimizer = optim.SGD(model.parameters(), lr=learning_rate, weight_decay=weight_decay, momentum=0.9)
+    elif opt_name == 'adadelta':
+        optimizer = optim.Adadelta(model.parameters())
+    elif opt_name == 'adamax':
+        optimizer = optim.Adamax(model.parameters(), lr=learning_rate)
+    elif opt_name == 'adagrad':
+        optimizer = optim.Adagrad(model.parameters(), lr=learning_rate)
+    elif opt_name == 'lbfgs':
+        optimizer = optim.LBFGS(model.parameters(), lr=learning_rate)
+    elif opt_name == 'adamw':
+        optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
+    elif opt_name == 'nadam':
+        optimizer = optim.NAdam(model.parameters(), lr=learning_rate, weight_decay = weight_decay, decoupled_weight_decay=True)
+    elif opt_name == 'radam':
+        optimizer = optim.RAdam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)# decoupled_weight_decay=True)
+    else:
+        raise "opt_name not recognized."
+    return optimizer