Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175

alliepiper · 2024-06-27T12:43:25Z

The current implementation computes the throughput statistics in measure_cold, which is invoked during state.exec. This has the undesirable effect that throughput statistics are not generated when reads/writes are declared after state.exec.

The statistics are added here.

The only piece of information that is needed from measure_cold is the average CUDA time. However, this information is added to a summary here, so it should be possible to add a post-processing step to retrieve this information and compute throughput statistics after the KernelGenerator is finished.

The text was updated successfully, but these errors were encountered:

…ks (#16126) This PR addresses #13735 for reduction benchmarks. There are 3 new utils added. - `int64_t estimate_size(cudf::table_view)` returns a size estimate for the given table. #13984 was a previous attempt to add a similar utility, but this implementation uses `cudf::row_bit_count()` as suggested in #13984 (comment) instead of manually estimating the size. - `void set_items_processed(State& state, int64_t items_processed_per_iteration)` is a thin wrapper of `State.SetItemsProcessed()`. This wrapper takes `items_processed_per_iteration` as a parameter instead of `total_items_processed`. This could be useful to avoid repeating `State.iterations() * items_processed_per_iteration` in each benchmark class. - `void set_throughputs(nvbench::state& state)` is added as a workaround for NVIDIA/nvbench#175. We sometimes want to set throughput statistics after `state.exec()` calls especially when it is hard to estimate the result size upfront. Here are snippets of reduction benchmarks after this change. ``` $ cpp/build/benchmarks/REDUCTION_BENCH ... ----------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------- Reduction/bool_all/10000/manual_time 10257 ns 26845 ns 68185 bytes_per_second=929.907M/s items_per_second=975.078M/s Reduction/bool_all/100000/manual_time 11000 ns 27454 ns 63634 bytes_per_second=8.46642G/s items_per_second=9.09075G/s Reduction/bool_all/1000000/manual_time 12671 ns 28658 ns 55261 bytes_per_second=73.5018G/s items_per_second=78.922G/s ... $ cpp/build/benchmarks/REDUCTION_NVBENCH ... ## rank_scan ### [0] NVIDIA RTX A5500 | T | null_probability | data_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | |-----------------|------------------|-----------|---------|------------|--------|------------|-------|----------|--------------|-----------| | I32 | 0 | 10000 | 16992x | 33.544 us | 14.95% | 29.446 us | 5.58% | 82.321M | 5.596 TB/s | 728.54% | | I32 | 0.1 | 10000 | 16512x | 34.358 us | 13.66% | 30.292 us | 2.87% | 80.020M | 5.286 TB/s | 688.17% | | I32 | 0.5 | 10000 | 16736x | 34.058 us | 14.31% | 29.890 us | 3.40% | 81.097M | 5.430 TB/s | 706.89% | ... ``` Note that, when the data type is a 1-byte-width type in the google benchmark result summary, `bytes_per_second` appears to be smaller than `items_per_second`. This is because the former is a multiple of 1000 whereas the latter is a multiple of 1024. They are in fact the same number. Implementation-wise, these are what I'm not sure if I made a best decision. - Each of new utils above is declared and defined in different files. I did this because I could not find a good place to have them all, and they seem to belong to different utilities. Please let me know if there is a better place for them. - All the new utils are defined in the global namespace since other util functions seem to have been defined in the same way. Please let me know if this is not the convention. Authors: - Jihoon Son (https://github.com/jihoonson) Approvers: - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #16126

jihoonson mentioned this issue Jun 28, 2024

Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks rapidsai/cudf#16126

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175

Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175

alliepiper commented Jun 27, 2024

Throughput statistics are not calculated when reads/writes are declared after state.exec() #175

Throughput statistics are not calculated when reads/writes are declared after state.exec() #175

Comments

alliepiper commented Jun 27, 2024

Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175

Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175