Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add L1 Benchmarking #1

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

Add L1 Benchmarking #1

wants to merge 24 commits into from

Conversation

schwarz-em
Copy link
Collaborator

The main changes in this PR are adding a python script (L1-benchmarking.py) to test coverage, accuracy and timeliness and adding a bash script (benchmarkingL1.sh) to run the tests automatically. The benchmarking script reads from the printed values in the RTL sim .out files to test the prefetcher metrics, so the printfs in the HellaCache prefetcher wrapper are necessary.

The way to calculate cache metrics is not standardized, so I'm open to suggestions on possible ways to make the calculations more useful/accurate.

Copy link
Contributor

@jerryz123 jerryz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat way to get metrics on simple bare-metal benchmarks, and I think its useful for that use case, but I have concerns over how this approach will scale to larger workloads. There's several edge cases which I'm not sure this approach can handle accurately.

  • What happens when the same address is accessed multiple times across different phases of the program? I see this is blind to duplicates. Does that not matter for simple benchmarks?
  • What happens when there is timing dependent execution? For simple benchmarks a non-prefetch config might execute the same code path as a prefetch config, but on a large Linux workload timing will affect the code paths taken.

Also, how does this prefetcher evaluation approach compare to the evaluation method in DPC3/DPC2?




main()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this pythonic:

if __name__ == "__main__":
  main()

import sys

def main():
with open(sys.argv[1]) as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to use python's argparse thing? https://docs.python.org/3/library/argparse.html

Notably, it lets you specify a "help" message, and makes it easier to add more arguments down the line.

src/main/scala/HellaCachePrefetcher.scala Show resolved Hide resolved
@@ -84,7 +87,28 @@ class HellaCachePrefetchWrapperModule(pP: CanInstantiatePrefetcher, outer: Hella
cache.io.cpu.req.bits.phys := false.B
cache.io.cpu.req.bits.no_alloc := false.B
cache.io.cpu.req.bits.no_xcpt := false.B
when (cache.io.cpu.req.fire()) { in_flight := true.B }
when (cache.io.cpu.req.fire()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave the original block, and add a separate block for the prefetch print statements, gated by a config option.

when (cache.io.cpu.req.fire() { in_flight := true.B }

if (printPrefetchingStats) {
  when (cache.io.cpu.req.fire()) {
    ...
  }
  when (prefetcher.io.snoop.valid) {
    ...
  }
  <etc>
}

You'll need to add a new parameter printPrefetcherStats to the config class WithHellaCachePrefetcher, HellaCachePrefetchWrapperFactory.apply, and HellaCachePrefetchWrapper

make run-binary CONFIG=PassthroughPrefetchSaturnConfig BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/benchmarks/vvadd.riscv
cp output/chipyard.TestHarness.PassthroughPrefetchSaturnConfig/vvadd.out ../../generators/bar-prefetchers/benchmarking/no-prefetchL1-vvadd.out
cd ../../generators/bar-prefetchers/benchmarking
python L1-benchmarking.py "prefetchL1-vvadd.out" "no-prefetchL1-vvadd.out"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this file should be committed to this repo, as Prefetch2SaturnConfig and PassthroughPrefetchSaturnConfig aren't defined for most people.
Really its just wrapping
python3 L1-benchmarking.py <path-to-prefetch.out> <path-to-no-prefetch.out>.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an option is to give as the arguments two configs that are compared against one another. Then it is up to the script caller to give two configs that are roughly equiv.


print("misses prevented: " + str(misses_prevented))

coverage = (misses_prevented + 0.0) / (misses_prevented + len(with_prefetch_misses)) * 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its easier to convert to float with float(misses_prevented)


//print snoop
when (prefetcher.io.snoop.valid) {
val last_snoop_addr = prefetcher.io.snoop.bits.address
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having Prefetch Addr, Snoop Addr, Resp Addr, Prefetch Resp Addr, I suggest you remove the spaces to have PrefetchAddr, SnoopAddr, RespAddr, PrefetchRespAddr. This makes your parsing in the python script uniform for all the cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also (marginally) speeds up simulation since prints are semi-costly.

@schwarz-em
Copy link
Collaborator Author

@jerryz123

I actually had a bit of difficulty trying to figure out how to handle the same address being accessed more than once and was not able to come up with a good solution. It wouldn't be so difficult if the snoop stream was the same for both the prefetch and non-prefetch configs, but since it's not, I can't compare them that way.
I haven't considered timing at all yet and I don't have any solutions for that off the top of my head. Similar to the above problem, I think what I have right now can give an okay approximation of prefetcher performance in that case, but it would require more time to find a good solution.
I'm definitely open to explore any possible ways to solve these - I'm just out of ideas for the first one in particular.

I'm not super familiar with the DPC2/DPC3 benchmarking system, but from looking through it quickly it seems like they're more concerned with overall performance (ie improvement in total cycles, CPI) than specific cache metrics. It looks like it's probably possible to do cache metric calculations from what is provided, but these can vary depending on who's doing the calculating since the metrics aren't standardized.

Copy link

@abejgonzalez abejgonzalez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits to the code itself. Jerry hit most of the important ones already.

As for the timing-dependent execution, I figure for simple baremetal bmarks this isn't an issue. This solution should never scale to Linux booting/workloads because printing is probably too slow to get meaningful numbers for large bmarks (I don't think we want to enable this for something like SPEC). Instead, we should transition to added HW measurements (maybe even a perf counter / autocounter to measure things).

benchmarking/L1-benchmarking.py Show resolved Hide resolved
benchmarking/README.md Show resolved Hide resolved
make run-binary CONFIG=PassthroughPrefetchSaturnConfig BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/benchmarks/vvadd.riscv
cp output/chipyard.TestHarness.PassthroughPrefetchSaturnConfig/vvadd.out ../../generators/bar-prefetchers/benchmarking/no-prefetchL1-vvadd.out
cd ../../generators/bar-prefetchers/benchmarking
python L1-benchmarking.py "prefetchL1-vvadd.out" "no-prefetchL1-vvadd.out"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an option is to give as the arguments two configs that are compared against one another. Then it is up to the script caller to give two configs that are roughly equiv.

Comment on lines 5 to 7
cd ../../..
source env.sh
cd sims/vcs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Depending on where this script is run the ../../../ may not lead where you expect. Here is an example of having this work in all cases:

https://github.com/ucb-bar/chipyard/blob/dcf8da4b2d3a4deead95462fce36a6db5693ed45/scripts/build-toolchains.sh#L9-L18


//print snoop
when (prefetcher.io.snoop.valid) {
val last_snoop_addr = prefetcher.io.snoop.bits.address

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also (marginally) speeds up simulation since prints are semi-costly.

resp_cycles = resp[1]
resp_addr = resp[4]
if (resp_addr in snoops):
if (((int(resp_cycles) - int(snoops[resp_addr])) >= 5) and (int(resp_cycles) - int(last_resp_cycle) > 3)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would shift the int() casting to be right when you access the string (i.e. resp_cycles = int(resp[1]))


//print response
when (cache.io.cpu.resp.valid && !isPrefetch(cache.io.cpu.resp.bits.cmd)) {
printf(p"Cycle: ${Decimal(cycle_counter)}\tResp Addr: ${Hexadecimal(cache.io.cpu.resp.bits.addr)}\n")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want a person to read this output and instead have the script parse/understand it only, you can simplify this to print a bit faster (i.e. a schema like "type, addr, cycle")

accesses = {"hits": [], "misses": []}
last_resp_cycle = 0
for line in lines:
if 'Snoop' in line:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is relatively brittle. What happens if the config you build has other printfs inside it that have Resp/Snoop in it. IMO you should use Python re to create a regex match.

resp_cycles = resp[1]
resp_addr = resp[4]
if (resp_addr in snoops):
if (((int(resp_cycles) - int(snoops[resp_addr])) >= 5) and (int(resp_cycles) - int(last_resp_cycle) > 3)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you determine 5/3 as the consts here? I would add a small comment here for other code readers.

Copy link

@abejgonzalez abejgonzalez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more small changes but I think this can be merged in soon (once they are addressed).

Cycle: decimal_int PrefetchRespAddr: hexadecimal_int
"""

snoop_regex = re.compile("^Cycle:\s*(\d+)\s*SnoopAddr:\s*([\da-f]+)\s*SnoopBlock:\s*([\da-f]+)\s*")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor changes:

  • \s* can probably be reduced to + since you are just doing spaces. If you have 1 space then I would just simplify and do .
  • IIRC re.match already starts the search at the start of the line so you don't need the ^.
  • I think you don't need to match the spaces at the end (you need to verify this). In any case, I think the better option would be to do .*.

@@ -1,12 +1,34 @@
#!/usr/bin/python

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Point to Python3 or 2

```
source benchmarkingL1.sh
source benchmarkingL1.sh [prefetch config] [non-prefetch config]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source benchmarkingL1.sh [prefetch config] [non-prefetch config]
./benchmarking/L1-benchmarking.sh [prefetch config] [non-prefetch config]

@@ -1,15 +1,25 @@
#!/bin/bash
# Run L1 prefetcher benchmark tests
# TODO: Add parameterization for other cores
# Run L1 prefetcher benchmark test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a small description of what $1 and $2 are supposed to be pointing to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this is running the test on vvadd. If we are going to add this, I think the benchmark should also be abstracted out.

if (printPrefetchingStats) {
when (cache.io.cpu.req.fire()) {
//print prefetch
val last_prefetch_addr = req.bits.block_address

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the if or outside of it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you can just delete this?

}
if (printPrefetchingStats) {
when (prefetcher.io.snoop.valid) {
val last_snoop_addr = prefetcher.io.snoop.bits.address

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment above.

accuracy_resp = float(misses_prevented) / (num_unique_prefetch_resps-len(useful_prefetches) + misses_prevented) * 100
print("accuracy (acknowledged): " + str(accuracy_resp) + "%")
if (num_prefetches_accessed != 0):
timeliness = (delta_sum + 0.0) / num_prefetches_accessed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeliness = (delta_sum + 0.0) / num_prefetches_accessed
timeliness = float(delta_sum) / num_prefetches_accessed

@@ -85,6 +88,31 @@ class HellaCachePrefetchWrapperModule(pP: CanInstantiatePrefetcher, outer: Hella
cache.io.cpu.req.bits.no_alloc := false.B
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a statically assigned boolean to control printing, this should be enabled by a plusArg.

  val enable_print_stats = PlusArg("prefetcher_print_stats", width=1, default=0)(0)
  when (enable_print_stats) { 
    // your print statements
  }

Then when running the sim just set EXTRA_SIM_FLAGS=+prefetcher_print_stats=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants