Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory balancing necessary? #16

Open
freifrauvonbleifrei opened this issue Dec 13, 2021 · 2 comments
Open

Memory balancing necessary? #16

freifrauvonbleifrei opened this issue Dec 13, 2021 · 2 comments
Assignees

Comments

@freifrauvonbleifrei
Copy link
Contributor

I am again running a weak scaling on Hawk (2GB/core, 128 cores/node).
This time, I am trying to get sensible scaling data for GENE+DisCoTec. But the required resolutions and memory footprints make it a lot harder!

While running the scheme on one process group of 4096 workers finishes fine,

[ct]
#last element has to be 1 -> specify species with special field
#dimension of problem
dim = 6
#minimum and maximum level of combination technique
lmin = 5 5 4 4 3 1
lmax = 10 5 9 9 8 1

#levelvector at which 2 final outputs are evaluated (with potential interpolation)
leval = 7 5 5 6 5 1
leval2 = 7 4 4 4 4 1
#indicates number of processors per dimension in domain decomposition
#this is the same for each process group
p = 8 1 8 8 8 1
#number of combination steps
ncombi = 12
#indicates whether combischeme is read from file
readspaces = 1
#indicates the file name of the 2 plot files
fg_file_path = ../plot.dat
fg_file_path2 = ../plot2.dat
#indicates which dimensions have boundary points
boundary = 1 1 1 1 1 0
#indicates which dimensions will be hierarchized
hierarchization_dims = 1 0 1 1 1 0

#possibility to reduce the level of the sparse grid for the combination step
reduceCombinationDimsLmin = 0 0 0 0 0 0
reduceCombinationDimsLmax = 1 0 1 1 1 0

[application]
#timestep size
dt = 0.6960E-02
#timesteps 100000
#number of timesteps between combinations
nsteps = 100000000
#allowed maximal simulation time (physical time) between combination steps
#if it would be exceeded finish also with less steps as defined above
combitime = 0.1
#physical parameters
#shat = 0.7960
kymin = 0.1525E-01
#box size
lx = 125.00
#numbers of species
numspecies = 1
#T for local runs F for global runs
GENE_local = F
#T for nonlinear F for linear runs
GENE_nonlinear = T
#The number of combinations after which we write out checkpoint to disk
checkpointFrequency = 50

[preproc]
#name of gene instance folders
basename = ginstance
#executable name of gene manager
executable = ./gene_hawk
#used mpi version
mpi = mpiexec
startscript = start.bat

[manager]
#number of mpi ranks in each group
ngroup = 1
#number of process groups
nprocs = 4096

it seems that we're running out of memory when running the same problem on eight process groups of 512 each:

#last element has to be 1 -> specify species with special field
#dimension of problem
dim = 6
#minimum and maximum level of combination technique
lmin = 5 5 4 4 3 1
lmax = 10 5 9 9 8 1

#levelvector at which 2 final outputs are evaluated (with potential interpolation)
leval = 7 5 5 6 5 1
leval2 = 7 4 4 4 4 1
#indicates number of processors per dimension in domain decomposition
#this is the same for each process group
p = 4 1 4 8 4 1
#number of combination steps
ncombi = 12
#indicates whether combischeme is read from file
readspaces = 1
#indicates the file name of the 2 plot files
fg_file_path = ../plot.dat
fg_file_path2 = ../plot2.dat
#indicates which dimensions have boundary points
boundary = 1 1 1 1 1 0
#indicates which dimensions will be hierarchized
hierarchization_dims = 1 0 1 1 1 0

#possibility to reduce the level of the sparse grid for the combination step
reduceCombinationDimsLmin = 0 0 0 0 0 0
reduceCombinationDimsLmax = 1 0 1 1 1 0

[application]
#timestep size
dt = 0.6960E-02
#timesteps 100000
#number of timesteps between combinations
nsteps = 100000000
#allowed maximal simulation time (physical time) between combination steps
#if it would be exceeded finish also with less steps as defined above
combitime = 0.1
#physical parameters
#shat = 0.7960
kymin = 0.1525E-01
#box size
lx = 125.00
#numbers of species
numspecies = 1
#T for local runs F for global runs
GENE_local = F
#T for nonlinear F for linear runs
GENE_nonlinear = T
#The number of combinations after which we write out checkpoint to disk
checkpointFrequency = 50

[preproc]
#name of gene instance folders
basename = ginstance
#executable name of gene manager
executable = ./gene_hawk
#used mpi version
mpi = mpiexec
startscript = start.bat

[manager]
#number of mpi ranks in each group
ngroup = 8
#number of process groups
nprocs = 512

It might be that one process group is assigned more memory-intense GENE tasks than the others. This would mean that memory does not correlate strongly enough with run time of the first step, such that our current round-robin assignment approach would work. (Remember: we have an estimate based on run time, I am currently using the gripoint based LinearLoadModel, which we use to assign one grid to each process group. But then, the next grids are assigned to process groups once they finish. For instance, if there is a component grid that takes VERY long to compute the first time step, there would be no other grid assigned to this group.)

I am currently working to verify that this is the exact problem.

The other possible source of error (that I can imagine) would be that the overhead of 3 additionally allocated sparse grids is the culprit. Given the sparse grid size of 481371297 grid points * 16 byte (double complex) * 3 = 1.345 gibibyte, this seems unlikely (this overhead is over all 4096 workers!).

This would make some kind of "memory balancing" necessary in the case of memory scarcity (in analogy to load balancing in the task assignment).

@freifrauvonbleifrei freifrauvonbleifrei self-assigned this Dec 13, 2021
@freifrauvonbleifrei
Copy link
Contributor Author

freifrauvonbleifrei commented Dec 13, 2021

OK, so after measuring the (approximate) memory needs of the individual tasks, this here should be the assignment that we have shortly before the program dies (in GB):

mem_process_groups: {'0x30da350': 1503.1598320156336,
                         '0x30da3b0': 82.14805999398232,
                         '0x30da410': 9.39700399339199,
                         '0x30da470': 8.53497602045536,
                         '0x30da4d0': 27.16380400955677,
                         '0x30da530': 240.82131199538708,
                         '0x30da590': 51.25829599797726,
                         '0x30da5f0': 707.63435998559}

Each process group has less than 1024 GB available, so the problem arises when Task 27 at 1062 GB is added to process group '0x30da350'. Task 27 has [l_x, l_y, l_z, l_v, l_w] = [10 5 4 4 3].

This means that yes: we have an imbalance, but no: we could not have distributed in a way that this large task fits onto one process group!

@freifrauvonbleifrei
Copy link
Contributor Author

freifrauvonbleifrei commented Dec 13, 2021

This here is the data I used (in case of follow-up problems):
assignments_512x8.csv
memory_tasks.csv

#!/usr/bin/env python3

from re import M
import numpy as np
import pandas as pd
from icecream import ic
import seaborn as sns


df = pd.read_table("memory_tasks.csv", sep="\s+",
                    header=None,
                    skiprows=lambda x: x % 2 != 0)

df.drop(df.columns[[0,1,3,5,9,10,11,12,13,14]], axis=1, inplace=True)
df.columns=["task", "l_x", "l_z", "l_v", "l_w", "pg"]
df["l_x"] = [int(x[3:]) for x in df["l_x"]]

mem = pd.read_table("memory_tasks.csv", sep="\s+",
                    header=None,
                    skiprows=lambda x: x % 2 != 1)
mem.drop(mem.columns[[0,1,2,3,4]], axis=1, inplace=True)
mem.columns=["memory"]
# mem["memory"] = mem["memory"]*1000 # from kilobyte to byte
mem["memory"] = mem["memory"]/1e6 # from kilobyte to gigabyte
mem["memory"] = abs(mem["memory"].diff(periods=-1))

df["memory"] = mem["memory"]


ic(df)

ic(df.corr())

assignments = pd.read_table("assignments_512x8.csv", sep="\s+",
                    header=None)
assignments.drop(assignments.columns[[0,1,3,4,5,6,7,8,9,10,11,12,13,14]], axis=1, inplace=True)
assignments.columns=["task", "pg"]

ic(assignments)

mem_process_groups = {}
for pg in assignments["pg"]:
    mem_process_groups[pg] = 0.

for index, value in assignments.iterrows():
    ic(value.pg, value.task)
    # find task in df
    found = df.loc[df['task'] == value.task]
    add_memory = found["memory"].values[0]
    ic(add_memory)
    mem_process_groups[value.pg] += add_memory
    ic(index, mem_process_groups)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant