Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary datasets field in methods database entries? #355

Open
yousefmoazzam opened this issue May 31, 2024 · 1 comment
Open
Labels
minor Nice to do but not vital question Further information is requested

Comments

@yousefmoazzam
Copy link
Collaborator

At a glance, it appears as if the datasets field is not used anywhere in max slices calculations.

Specifically, it's not used in the task runner's max slices calculations:

def determine_max_slices(self, section: Section, slicing_dim: int):
assert self.source is not None
data_shape = self.source.chunk_shape
max_slices = data_shape[slicing_dim]
if len(section) == 0:
section.max_slices = min(httomo.globals.MAX_CPU_SLICES, max_slices)
return
nsl_dim_l = list(data_shape)
nsl_dim_l.pop(slicing_dim)
non_slice_dims_shape = (nsl_dim_l[0], nsl_dim_l[1])
available_memory = get_available_gpu_memory(10.0)
available_memory_in_GB = round(available_memory / (1024**3), 2)
memory_str = (
f"The amount of the available GPU memory is {available_memory_in_GB} GB"
)
log_rank(memory_str, comm=self.comm)
if self._memory_limit_bytes != 0:
available_memory = min(available_memory, self._memory_limit_bytes)
log_once(
f"The memory has been limited to {available_memory / (1024**3):4.2f} GB",
level=logging.DEBUG,
)
max_slices_methods = [max_slices] * len(section)
# loop over all methods in section
has_gpu = False
for idx, m in enumerate(section):
if len(m.memory_gpu) == 0:
max_slices_methods[idx] = max_slices
continue
has_gpu = has_gpu or m.is_gpu
output_dims = m.calculate_output_dims(non_slice_dims_shape)
(slices_estimated, available_memory) = m.calculate_max_slices(
self.source.dtype,
non_slice_dims_shape,
available_memory,
)
max_slices_methods[idx] = min(max_slices, slices_estimated)
non_slice_dims_shape = output_dims
if not has_gpu:
section.max_slices = min(
min(max_slices_methods), httomo.globals.MAX_CPU_SLICES
)
else:
section.max_slices = min(max_slices_methods)

nor in the generic method wrapper's implementation of calculate_max_slices() (which all method wrappers inherit from):

def calculate_max_slices(
self,
data_dtype: np.dtype,
non_slice_dims_shape: Tuple[int, int],
available_memory: int,
) -> Tuple[int, int]:
"""If it runs on GPU, determine the maximum number of slices that can fit in the
available memory in bytes, and return a tuple of
(max_slices, available_memory)
The available memory may have been adjusted for the methods that follow, in case
something persists afterwards.
"""
if self.is_cpu or not gpu_enabled:
return int(100e9), available_memory
# if we have no information, we assume in-place operation with no extra memory
if len(self.memory_gpu) == 0:
return (
int(
available_memory
// (np.prod(non_slice_dims_shape) * data_dtype.itemsize)
),
available_memory,
)
# NOTE: This could go directly into the methodquery / method database,
# and here we just call calculated_memory_bytes
memory_bytes_method = 0
for field in self.memory_gpu:
subtract_bytes = 0
# loop over the dataset names given in the library file and extracting
# the corresponding dimensions from the available datasets
if field.method == "direct":
assert field.multiplier is not None
# this calculation assumes a direct (simple) correspondence through multiplier
memory_bytes_method += int(
field.multiplier
* np.prod(non_slice_dims_shape)
* data_dtype.itemsize
)
else:
(
memory_bytes_method,
subtract_bytes,
) = self._query.calculate_memory_bytes(
non_slice_dims_shape, data_dtype, **self.config_params
)
if memory_bytes_method == 0:
return available_memory - subtract_bytes, available_memory
return (
available_memory - subtract_bytes
) // memory_bytes_method, available_memory

I was wondering if the datasets field is an old remnant from when input datasets in the framework were named, and is thus no longer needed?

@yousefmoazzam yousefmoazzam added question Further information is requested minor Nice to do but not vital labels May 31, 2024
@yousefmoazzam
Copy link
Collaborator Author

More investigation is needed into whether it's needed or not, it seems that the query object for getting method info from the methods database YAML files references the datasets field:

def get_memory_gpu_params(
self,
) -> List[GpuMemoryRequirement]:
p = get_method_info(self.module_path, self.method_name, "memory_gpu")
if p is None or p == "None":
return []
if type(p) == list:
# convert to dict first
dd: dict = dict()
for item in p:
dd |= item
else:
dd = p
# now iterate and make it into one
assert (
len(dd["datasets"]) == len(dd["multipliers"]) == len(dd["methods"])
), "Invalid data"
return [
GpuMemoryRequirement(
dataset=d, multiplier=dd["multipliers"][i], method=dd["methods"][i]
)
for i, d in enumerate(dd["datasets"])
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Nice to do but not vital question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant