-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add L1 Benchmarking #1
base: master
Are you sure you want to change the base?
Conversation
…nto ella-prefetching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a neat way to get metrics on simple bare-metal benchmarks, and I think its useful for that use case, but I have concerns over how this approach will scale to larger workloads. There's several edge cases which I'm not sure this approach can handle accurately.
- What happens when the same address is accessed multiple times across different phases of the program? I see this is blind to duplicates. Does that not matter for simple benchmarks?
- What happens when there is timing dependent execution? For simple benchmarks a non-prefetch config might execute the same code path as a prefetch config, but on a large Linux workload timing will affect the code paths taken.
Also, how does this prefetcher evaluation approach compare to the evaluation method in DPC3/DPC2?
benchmarking/L1-benchmarking.py
Outdated
|
||
|
||
|
||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this pythonic:
if __name__ == "__main__":
main()
benchmarking/L1-benchmarking.py
Outdated
import sys | ||
|
||
def main(): | ||
with open(sys.argv[1]) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to use python's argparse thing? https://docs.python.org/3/library/argparse.html
Notably, it lets you specify a "help" message, and makes it easier to add more arguments down the line.
@@ -84,7 +87,28 @@ class HellaCachePrefetchWrapperModule(pP: CanInstantiatePrefetcher, outer: Hella | |||
cache.io.cpu.req.bits.phys := false.B | |||
cache.io.cpu.req.bits.no_alloc := false.B | |||
cache.io.cpu.req.bits.no_xcpt := false.B | |||
when (cache.io.cpu.req.fire()) { in_flight := true.B } | |||
when (cache.io.cpu.req.fire()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave the original block, and add a separate block for the prefetch print statements, gated by a config option.
when (cache.io.cpu.req.fire() { in_flight := true.B }
if (printPrefetchingStats) {
when (cache.io.cpu.req.fire()) {
...
}
when (prefetcher.io.snoop.valid) {
...
}
<etc>
}
You'll need to add a new parameter printPrefetcherStats
to the config class WithHellaCachePrefetcher
, HellaCachePrefetchWrapperFactory.apply
, and HellaCachePrefetchWrapper
benchmarking/benchmarkingL1.sh
Outdated
make run-binary CONFIG=PassthroughPrefetchSaturnConfig BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/benchmarks/vvadd.riscv | ||
cp output/chipyard.TestHarness.PassthroughPrefetchSaturnConfig/vvadd.out ../../generators/bar-prefetchers/benchmarking/no-prefetchL1-vvadd.out | ||
cd ../../generators/bar-prefetchers/benchmarking | ||
python L1-benchmarking.py "prefetchL1-vvadd.out" "no-prefetchL1-vvadd.out" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this file should be committed to this repo, as Prefetch2SaturnConfig
and PassthroughPrefetchSaturnConfig
aren't defined for most people.
Really its just wrapping
python3 L1-benchmarking.py <path-to-prefetch.out> <path-to-no-prefetch.out>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an option is to give as the arguments two configs that are compared against one another. Then it is up to the script caller to give two configs that are roughly equiv.
benchmarking/L1-benchmarking.py
Outdated
|
||
print("misses prevented: " + str(misses_prevented)) | ||
|
||
coverage = (misses_prevented + 0.0) / (misses_prevented + len(with_prefetch_misses)) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its easier to convert to float with float(misses_prevented)
|
||
//print snoop | ||
when (prefetcher.io.snoop.valid) { | ||
val last_snoop_addr = prefetcher.io.snoop.bits.address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having Prefetch Addr, Snoop Addr, Resp Addr, Prefetch Resp Addr
, I suggest you remove the spaces to have PrefetchAddr, SnoopAddr, RespAddr, PrefetchRespAddr
. This makes your parsing in the python script uniform for all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also (marginally) speeds up simulation since prints are semi-costly.
I actually had a bit of difficulty trying to figure out how to handle the same address being accessed more than once and was not able to come up with a good solution. It wouldn't be so difficult if the snoop stream was the same for both the prefetch and non-prefetch configs, but since it's not, I can't compare them that way. I'm not super familiar with the DPC2/DPC3 benchmarking system, but from looking through it quickly it seems like they're more concerned with overall performance (ie improvement in total cycles, CPI) than specific cache metrics. It looks like it's probably possible to do cache metric calculations from what is provided, but these can vary depending on who's doing the calculating since the metrics aren't standardized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits to the code itself. Jerry hit most of the important ones already.
As for the timing-dependent execution, I figure for simple baremetal bmarks this isn't an issue. This solution should never scale to Linux booting/workloads because printing is probably too slow to get meaningful numbers for large bmarks (I don't think we want to enable this for something like SPEC). Instead, we should transition to added HW measurements (maybe even a perf counter / autocounter to measure things).
benchmarking/benchmarkingL1.sh
Outdated
make run-binary CONFIG=PassthroughPrefetchSaturnConfig BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/benchmarks/vvadd.riscv | ||
cp output/chipyard.TestHarness.PassthroughPrefetchSaturnConfig/vvadd.out ../../generators/bar-prefetchers/benchmarking/no-prefetchL1-vvadd.out | ||
cd ../../generators/bar-prefetchers/benchmarking | ||
python L1-benchmarking.py "prefetchL1-vvadd.out" "no-prefetchL1-vvadd.out" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an option is to give as the arguments two configs that are compared against one another. Then it is up to the script caller to give two configs that are roughly equiv.
benchmarking/benchmarkingL1.sh
Outdated
cd ../../.. | ||
source env.sh | ||
cd sims/vcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Depending on where this script is run the ../../../
may not lead where you expect. Here is an example of having this work in all cases:
|
||
//print snoop | ||
when (prefetcher.io.snoop.valid) { | ||
val last_snoop_addr = prefetcher.io.snoop.bits.address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also (marginally) speeds up simulation since prints are semi-costly.
benchmarking/L1-benchmarking.py
Outdated
resp_cycles = resp[1] | ||
resp_addr = resp[4] | ||
if (resp_addr in snoops): | ||
if (((int(resp_cycles) - int(snoops[resp_addr])) >= 5) and (int(resp_cycles) - int(last_resp_cycle) > 3)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would shift the int()
casting to be right when you access the string (i.e. resp_cycles = int(resp[1])
)
|
||
//print response | ||
when (cache.io.cpu.resp.valid && !isPrefetch(cache.io.cpu.resp.bits.cmd)) { | ||
printf(p"Cycle: ${Decimal(cycle_counter)}\tResp Addr: ${Hexadecimal(cache.io.cpu.resp.bits.addr)}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want a person to read this output and instead have the script parse/understand it only, you can simplify this to print a bit faster (i.e. a schema like "type, addr, cycle")
benchmarking/L1-benchmarking.py
Outdated
accesses = {"hits": [], "misses": []} | ||
last_resp_cycle = 0 | ||
for line in lines: | ||
if 'Snoop' in line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is relatively brittle. What happens if the config you build has other printfs inside it that have Resp/Snoop
in it. IMO you should use Python re
to create a regex match.
benchmarking/L1-benchmarking.py
Outdated
resp_cycles = resp[1] | ||
resp_addr = resp[4] | ||
if (resp_addr in snoops): | ||
if (((int(resp_cycles) - int(snoops[resp_addr])) >= 5) and (int(resp_cycles) - int(last_resp_cycle) > 3)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you determine 5/3 as the consts here? I would add a small comment here for other code readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more small changes but I think this can be merged in soon (once they are addressed).
benchmarking/L1-benchmarking.py
Outdated
Cycle: decimal_int PrefetchRespAddr: hexadecimal_int | ||
""" | ||
|
||
snoop_regex = re.compile("^Cycle:\s*(\d+)\s*SnoopAddr:\s*([\da-f]+)\s*SnoopBlock:\s*([\da-f]+)\s*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor changes:
\s*
can probably be reduced to+
since you are just doing spaces. If you have 1 space then I would just simplify and do- IIRC
re.match
already starts the search at the start of the line so you don't need the^
. - I think you don't need to match the spaces at the end (you need to verify this). In any case, I think the better option would be to do
.*
.
benchmarking/L1-benchmarking.py
Outdated
@@ -1,12 +1,34 @@ | |||
#!/usr/bin/python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Point to Python3 or 2
benchmarking/README.md
Outdated
``` | ||
source benchmarkingL1.sh | ||
source benchmarkingL1.sh [prefetch config] [non-prefetch config] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source benchmarkingL1.sh [prefetch config] [non-prefetch config] | |
./benchmarking/L1-benchmarking.sh [prefetch config] [non-prefetch config] |
@@ -1,15 +1,25 @@ | |||
#!/bin/bash | |||
# Run L1 prefetcher benchmark tests | |||
# TODO: Add parameterization for other cores | |||
# Run L1 prefetcher benchmark test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a small description of what $1 and $2 are supposed to be pointing to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this is running the test on vvadd
. If we are going to add this, I think the benchmark should also be abstracted out.
if (printPrefetchingStats) { | ||
when (cache.io.cpu.req.fire()) { | ||
//print prefetch | ||
val last_prefetch_addr = req.bits.block_address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the if
or outside of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you can just delete this?
} | ||
if (printPrefetchingStats) { | ||
when (prefetcher.io.snoop.valid) { | ||
val last_snoop_addr = prefetcher.io.snoop.bits.address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as comment above.
benchmarking/L1-benchmarking.py
Outdated
accuracy_resp = float(misses_prevented) / (num_unique_prefetch_resps-len(useful_prefetches) + misses_prevented) * 100 | ||
print("accuracy (acknowledged): " + str(accuracy_resp) + "%") | ||
if (num_prefetches_accessed != 0): | ||
timeliness = (delta_sum + 0.0) / num_prefetches_accessed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeliness = (delta_sum + 0.0) / num_prefetches_accessed | |
timeliness = float(delta_sum) / num_prefetches_accessed |
@@ -85,6 +88,31 @@ class HellaCachePrefetchWrapperModule(pP: CanInstantiatePrefetcher, outer: Hella | |||
cache.io.cpu.req.bits.no_alloc := false.B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a statically assigned boolean to control printing, this should be enabled by a plusArg
.
val enable_print_stats = PlusArg("prefetcher_print_stats", width=1, default=0)(0)
when (enable_print_stats) {
// your print statements
}
Then when running the sim just set EXTRA_SIM_FLAGS=+prefetcher_print_stats=1
The main changes in this PR are adding a python script (L1-benchmarking.py) to test coverage, accuracy and timeliness and adding a bash script (benchmarkingL1.sh) to run the tests automatically. The benchmarking script reads from the printed values in the RTL sim .out files to test the prefetcher metrics, so the printfs in the HellaCache prefetcher wrapper are necessary.
The way to calculate cache metrics is not standardized, so I'm open to suggestions on possible ways to make the calculations more useful/accurate.