LOOP-OPTIMIZATION: restarting and parfors

- Restarting protocol for both U-driven and T-driven lines - Hybrid Distributed/Restarting protocol for (U,T) scans > the external T-driven loop is parallelized in independent processes through the native parfor construct > there is no restart connecting each initial T value > this assumes a sufficiently low/high interaction, so that the noninteracting/atomic limit seeds are sufficient to achieve a good and fast convergence, and proceed from there with the sequential, restarting-aided U-driven lines > this *requires* the Parallel Computing Toolbox to be actually performed, but runs just fine if such plug-in is missing: the parfor just fall back to traditional sequential loops. A test with the current input state gives a full (U,T) diagram in less then five minutes, on a local machine with 4 physical i5 cores. Sounds good.
beddalumia · Feb 26, 2022 · 0ae3305 · 0ae3305
1 parent bb4fe75
commit 0ae3305
Show file tree

Hide file tree

Showing 4 changed files with 41 additions and 20 deletions.
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -20,9 +20,9 @@
 - [x] Compute the Luttinger integral, as defined in `PRB 90 075150 (2014)`. Since it appears to be quantized _at very low temperatures_, it could become the definitive **flag** for _quantum_ phase diagrams; much better than Z or S for it is an **integer**. [→ easier phase-boundary recognition!]  
     > Luttinger Theorem currently works for very low temperatures only. It might well be an inherent limitation, restricting its domain to the quantum Mott transition. Also note that to have a sharp first order step-up at the transition a very highly frequency resolution is needed, so to make the IPT solver performance-critical! [solved brilliantly with fast convolutions, see the `solver-optimization` section below]
 
-- [x] `SOLVER-OPTIMIZATION`: make the SOPT runs faster, by optimizing the needed convolutions. [implemented a pow2-optimized FFTW-based custom algorithm that actually greatly improves the cpu-time for the `wres=2^15` calculation: almost a x10 overall speedup!]
+- [x] `SOLVER-OPTIMIZATION`: make the SOPT run faster, by optimizing the needed convolutions. [implemented a pow2-optimized FFTW-based custom algorithm that actually greatly improves the cpu-time for the `wres=2^15` calculation: almost a x10 overall speedup!]
 
-- [ ] `LOOP-OPTIMIZATION`: insert a "restarting" protocol for lines and full phase diagram spans. The gloc0=0 condition appers to be too unstable to obtain accurate UC1 values. Furthermore this would most probably speed up a lot the convergence, by lowering the required number of iterations.
+- [x] `LOOP-OPTIMIZATION`: insert a "restarting" protocol for lines and full phase diagram spans. The gloc0=0 condition appers to be too unstable to obtain accurate UC1 values. Furthermore this would most probably speed up a lot the convergence, by lowering the required number of iterations.
 
 - [x] `HPC-OPTIMIZATION`: configure an interface to cluster facilities and define the scheduling resources for optimal running [no distributed computing, just built-in handling of shared-memory parallelization]  
 

diff --git a/code/+phys/sopt.m b/code/+phys/sopt.m
@@ -40,7 +40,7 @@
 
 end
 
-if FAST && DEBUG
+if FAST & DEBUG % Avoiding && due to parfors...
   %% Cross-Check
      P_test = conv(A,A,'same');
      err1 = norm(P_test-P)/norm(P_test);

diff --git a/code/main.m b/code/main.m
@@ -11,21 +11,22 @@
 end
 
 %% INPUT: Physical Parameters 
-D    = 4.0;             % Bandwidth
-U    = 4.0;             % On-site Repulsion
+D    = 1.0;             % Bandwidth
+U    = 3.0;             % On-site Repulsion
 beta = inf;             % Inverse Temperature
 
 %% INPUT: Boolean Flags
 MottBIAS     = 0;       % Changes initial guess of gloc (strongly favours Mott phase)
-ULINE        = 1;       % Takes and fixes the given beta value and performs a U-driven line
+ULINE        = 0;       % Takes and fixes the given beta value and performs a U-driven line
 TLINE        = 0;       % Takes and fixes the given U value and performs a T-driven line
-UTSCAN       = 0;       % Ignores both given U and beta values and builds a full phase diagram
-SPECTRAL     = 1;       % Controls plotting of spectral functions
+UTSCAN       = 1;       % Ignores both given U and beta values and builds a full phase diagram
+SPECTRAL     = 0;       % Controls plotting of spectral functions
 PLOT         = 1;       % Controls plotting of *all static* figures
 GIF          = 0;       % Controls plotting of *animated* figures
 UARRAY       = 0;       % Activates SLURM scaling of interaction values
-TARRAY       = 0;       % Activates SLURM scaling of temperature values                    
-DEBUG        = 1;       % Activates debug prints / plots / operations
+TARRAY       = 0;       % Activates SLURM scaling of temperature values 
+RESTART      = 1;       % Activates the restarting strategies for lines               
+DEBUG        = 0;       % Activates debug prints / plots / operations
 FAST         = 1;       % Activates fast FFTW-based convolutions
 
 %% INPUT: Control Parameters
@@ -38,7 +39,7 @@
 Ustep = 0.10 * D;       % Hubbard U incremental step for phase diagrams
 Umax  = 6.00 * D;       % Hubbard U maximum value for phase diagrams
 Tmin  = 1e-3;           % Temperature U minimum value for phase diagrams
-Tstep = 1e-3;           % Temperature incremental step for phase diagrams
+Tstep = 1e-2;           % Temperature incremental step for phase diagrams
 Tmax  = 5e-2;           % Temperature U maximum value for phase diagrams
 dt    = 0.05;           % Frame duration in seconds (for GIF plotting)
 
@@ -63,17 +64,17 @@
 
 % Initial guess for the local Green's function
 if MottBIAS
-   gloc_0 = 0; % no bath -> no Kondo resonance -> strong Mott bias :)
+   seed = 0; % no bath -> no Kondo resonance -> strong Mott bias :)
 else
-   gloc_0 = phys.bethe(w + 10^(-3)*1i,D); % D is the DOS "radius"
+   seed = phys.bethe(w + 10^(-3)*1i,D);
 end
 
 %% Workflows
 
 if not( ULINE || TLINE || UTSCAN )
     %% Single (U,T) point
     fprintf('Single point evaluation @ U = %f, T = %f\n\n',U,1/beta); tic
-    [gloc,sloc] = dmft_loop(gloc_0,w,D,U,beta,mloop,mix,err);
+    [gloc,sloc] = dmft_loop(seed,w,D,U,beta,mloop,mix,err);
     Z = phys.zetaweight(w,sloc);
     I = phys.luttinger(w,sloc,gloc);
     S = phys.strcorrel(w,sloc);
@@ -87,12 +88,17 @@
 if ULINE
     %% U-driven MIT line [given T]
     fprintf('U-driven span @ T = %f\n\n',1/beta); tic
-    clear('gloc','sloc','Z','I','S')
+    clear('gloc','sloc','Z','I','S'); 
     Uvec = Umin:Ustep:Umax; NU = length(Uvec);
+    gloc_0 = seed; gloc = cell(NU,1); sloc = gloc;
+    Z = zeros(NU,1); I = zeros(NU,1); S = zeros(NU,1);
     for i = 1:NU 
         U = Uvec(i);
         fprintf('< U = %f\n',U);
         [gloc{i},sloc{i}] = dmft_loop(gloc_0,w,D,U,beta,mloop,mix,err,'quiet');
+        if(RESTART)
+           gloc_0 = gloc{i}; 
+        end
         Z(i) = phys.zetaweight(w,sloc{i});
         I(i) = phys.luttinger(w,sloc{i},gloc{i});
         S(i) = phys.strcorrel(w,sloc{i});
@@ -112,10 +118,15 @@
     fprintf('T-driven span @ U = %f\n\n',U); tic
     clear('gloc','sloc','Z','I','S')
     Tvec = Tmin:Tstep:Tmax; NT = length(Tvec);
+    gloc_0 = seed; gloc = cell(NT,1); sloc = gloc;
+    Z = zeros(NT,1); I = zeros(NT,1); S = zeros(NT,1);
     for i = 1:NT 
         T = Tvec(i); beta = 1/T;
         fprintf('< T = %f\n',T);
         [gloc{i},sloc{i}] = dmft_loop(gloc_0,w,D,U,beta,mloop,mix,err,'quiet');
+        if(RESTART)
+           gloc_0 = gloc{i}; 
+        end
         Z(i) = phys.zetaweight(w,sloc{i});
         I(i) = phys.luttinger(w,sloc{i},gloc{i});
         S(i) = phys.strcorrel(w,sloc{i});
@@ -134,23 +145,31 @@
     %% Full Phase-Diagram [U-driven]
     fprintf('Full phase diagram\n\n'); tic; 
     clear('gloc','sloc','Z','I','S')
-    %restart_gloc = gloc_0;
     Tvec = Tmin:Tstep:Tmax; NT = length(Tvec);
     Uvec = Umin:Ustep:Umax; NU = length(Uvec);
-    for i = 1:NT 
+    gloc = cell(NT,NU); sloc = gloc;
+    Z = zeros(NT,NU);  I = Z; S = Z;
+    feature('numcores');
+    parfor i = 1:NT 
         T = Tvec(i); beta = 1/T;
+        gloc_0 = seed;
         for j = 1:NU  
             U = Uvec(j);
             fprintf('< U = %f, T = %f\n',U, T);
             [gloc{i,j},sloc{i,j}] = dmft_loop(gloc_0,w,D,U,beta,mloop,mix,err,'quiet');
-            %restart_gloc = gloc{i,j};
+            if(RESTART)
+               gloc_0 = gloc{i,j};
+            end
             Z(i,j) = phys.zetaweight(w,sloc{i,j});
             I(i,j) = phys.luttinger(w,sloc{i,j},gloc{i,j});
             S(i,j) = phys.strcorrel(w,sloc{i,j});
         end
     end
     if(PLOT)
-        phasemap = plot.phase_diagram(S,Umin,Ustep,Umax,Tmin,Tstep,Tmax);
+        S = S/max(max(S));
+        zetamap = plot.phase_diagram(Z,Umin,Ustep,Umax,Tmin,Tstep,Tmax);
+        luttmap = plot.phase_diagram(I,Umin,Ustep,Umax,Tmin,Tstep,Tmax);
+        strcmap = plot.phase_diagram(S,Umin,Ustep,Umax,Tmin,Tstep,Tmax);
     end
     ET = [0,0,toc]; fmt = 'hh:mm:ss.SSS';
     fprintf('> %s < elapsed time\n\n',duration(ET,'format',fmt));

diff --git a/docs/hpc_scaling/README.md b/docs/hpc_scaling/README.md
@@ -58,4 +58,6 @@ In conclusion I would suggest to perform single runs just on your personal works
 
 ## Distributed computing
 
-At the moment no distributed computing support is provided for the package. Since the convergence of the scans (both lines and phase diagrams) is highly aided by a restarting protocol (making each single-point run dependent on the previous one) and even more the DMFT-loop is inherently all-entangled, there is no simple target for a MPI-like parallelization. Furthermore the overall performance of the current workflows seems quite acceptable to just discard the idea.
+Some form of distrubuted computing is implemented for only the two-dimensional phase diagrams: the external (temperature) loops are parallelized through the native `parfor` construct, which autodetects the number of available physical cores and activates a corresponding number of independent 'workers'. This is reasonable within the assumption that each internal U-driven line starts at a sufficiently low interaction, so to make unuseful a restarting protocol that connects lines at different temperatures. No profiling has been done so far on this procedure.
+
+> NB. If the Parallel Computing Toolbox is not installed (or if you are running GNU Octave) the `parfor` construct acts as an alias of the traditional sequential `for` loop.