-
Notifications
You must be signed in to change notification settings - Fork 15
Home
Intro slides Memory Usage Segment Linking Agenda Validation numbers
The Phase II geometry makes use of double layered modules. This means that incoming particles will pass through and deposit energy into points in coincidence close together. We look for double hits within a module that are within a window range set by what we can expect the helix to look like for a low (0.8) pt track. If the hits are within this window, we link the two hits into a mini doublet object. Similarly, minidoublets in successive layers that are close enough together and have both minidoublets pointing in the same direction can be linked to form segments. Pixel line segments are also imported as segments
Afterwards, segments with a shared mini doublet are linked into triplets (T3), which have 3 minidoublets. Triplets that share a minidoublet are linked to make quintuplets (which have 5 mini doublets). These must both pass some criteria that indicates that the angles between the segments are small and that they pass some consistency cuts in the r-z and r-phi planes. These thresholds need to be tuned.
After these objects are made, we go on to link T5s with pixel line segments (pLS) that point towards the created T5. This constructs the pT5, which is our longest natural object. These objects are cleaned through a duplicate removal step that checks the number of identical hits used between objects and keeps only the one with the best track score (measured from the residuals in r-z and r-phi). The cleaned objects are stored as track candidates (TC) and the T3s, T5s and pLS used are marked as such.
After these are added, we construct pT3s from the T3s and pLSs that remain in the same way. These are once again cleaned amongst themselves, and then if they are cleaned against the pT5 collection. The remaining pT3 objects are added to the TC collection. Then we go back to the T5s, clean then again and then add what few extra T5s remain to the TC. Finally, we add in pLS that were not already used in the pT3, pT5 collection. We only add in quintuplet pLS (not the triplets, but triplets can be used in pT5 and pT3 creation) that pass the cleaning. This fills out the TC collection. After the TC collection is made, the tracks used can then be extended to add in any additional hits in subsequent layers that might have been missed.
First run
source setup.sh
This sets up the environment
Base command:
sdl_validate_efficiency <dataset>
Where is usually either PU200 or muonGun depending on the sample.
By default, this runs through the explicit and unified version with 200 events.null You can also specify a specific version (including caching version) and a specific number of events.
sdl_validate_efficiency <dataset> <Type> <nevents>
Where = {“explicit”,”unified”,”explicit_cache”, “unified_cache”} So for example you can run:
sdl_validate_efficiency PU200 "explicit_cache" 30
Which runs physics validation for the pileup 200 sample, the explicit cache version with only 30 events. You can check the most recent validation results broken up by object type at this URL.
Similarly, you can run the timing packages in the same way:
sdl_timing <dataset> <Type> <nevents>
So for example
sdl_timing PU200
will run all types (explicit and unified with both caching and non-caching) over 200 events. Then, your output will be a final table with the averages. This also runs for multiple streams whenever possible. Your result will look like this:
So you see the average kernel runtimes per event in each column. The total time per event is shown under “event”. However, we also don’t care about some of these kernels. For example, we expect to have the hits already on the GPU and we also expect a clean pixel Line segment collection. So we exclude these times. The result is shown in “short”. The effective time helps to understand how this “short” time changes when we introduce multistreaming. This is the time we report as our actual time. Though there is some issue with this method, we know it is not exact, but i think it is good enough.
Note that both of these packages are located in “efficiency/bin/” if you want to look into the scripts.
For testing purposes, you don’t need (or want) to use the full validation and timing packages. Instead you can make specific version yourself and run with different commands yourself.
To make the code you can run
sdl_make_tracklooper
This comes with a list of options listed here:
You always want to rune with -m -8
and -e
, which cleans before you make, sets the PT threshold to 0.8 (which is the preferred value right now) and also runs the track extension code. -3
is not fully functional.
Mostly you will want to toggle between -x
and -c
.
-x controls the version of memory management. Without this flag, you are running using unified memory. This means that memory allocations are done using cudaMallocManaged. This command makes use of cudas unified memory management which allocates memory on both the host and device and handles transfers between them. It is useful for getting setup quickly without errors, but also has some overhead. Thus, these times are typically slower than using explicit memory, where you allocate on the device only (using cudaMalloc) and have to specifically do memory transfers when needed (using cudaMemcpyAsync).
-c controls the caching allocator. Without it, we do not use caching. With it, we do. The caching allocator sets aside a chunk of memory after you allocate for the first time. Then, in successive events, instead of freeing your memory, you are actually just sort of resetting it. The space in memory is still reserved so that future allocations run faster. This usually runs slower for the first event but then faster for others.
-l toggles on the hit preloading. Currently it takes 2.5 GB for 200 events (about 12 MB per event). This allocates the memory and loads the hits for each event during the setup phase for the run. Then, during run time, the pointer to the hit struct is simply copied for that event.
Remember that all memory allocations need to be freed to avoid a memory leak. You can check for leaks using nvprof, which i will talk about later
Sometimes, making fails. Then a hidden log file will be reported to you. You can check this file to see what made the process fail (i usually scan the file for “err” to see what's up). Note, that the code often fails to be made and all you need to do is try again (or sometimes rerun the setup and then retry). IDK why this happens but you’ll figure out what to look for in the log file to see if this is happening or if it is a real issue pretty easily (for example, just this line: collect2: error: ld returned 1 exit status
).
When you make the project, you will see the compilation options selected, so you can double check you have made the correct version of the code.
After the project is compiled, you can run it with
./bin/sdl -n <nevents> -v <verbose> -w <writeout> -s <streams> -i <dataset>
Like before, dataset = {PU200,muonGun} and nevents is the number of events. Usually a test case of 1-30 or the full 200 events used in validation. However, I think more events exist.
-s obviously sets the number of streams. Usually set to 1, but should be able to go up to 8 for most versions. The explicit cache version use too much memory at the moment so you can only get up to 4-6 streams.
-v
sets the verbosity level.
0: prints out nothing but the event number
1: prints out just the timing information
2: prints out the timing, timing for each kernel and the object multiplicity.
-v 2
is great for debugging. Usually looking to make sure you have about the same number of objects for the first event.
-w
is the writeout. This is usually either 0 or 2. Adding this in makes the code run longer after the objects have been made as you now have to do memory transfers back to the host. This is necessary for the validation. I usually keep it at 0 unless i am actually messing with the validation.
So all in all, this can be run like
./bin/sdl -n 1 -v 2 -w 0 -s 1 -I PU200
Which runs 1 PU200 event over 1 stream with object printouts but no validation writeout. Ie, good for debugging.
Before submitting a pull request, you always want to do a quick check to make sure there are no silent runtime errors or memory leaks. For memory leaks, you can run using nvprof.
nvprof ./bin/sdl -n 1 -v 2 -w 0 -s 1 -I PU200
This runs the profiler over the code. Note that the usual kernel times will be longer but you do have the individual kernel runtimes from the profiler itself. The output looks like this:
However, what you are really looking for is the number of mallocs and frees. You need cudaFreeHost = cudaMallocHost And cudaFree = cudaMalloc + cudaMallocManaged. There should be no instances of cudaMallocManaged in the explicit versions!
You can also look for race condition or other general issues using cuda-memcheck
cuda-memcheck ./bin/sdl -n 1 -v 2 -w 0 -s 1 -i PU200
This should report 0 errors.
Every PR you make should run these two tests plus the validation and timing (and not any changes in the validation plots and any significant changes in the timing) This will make it easier to find issues.
Profiling is done via nsight systems and nvidia compute. nsight systems is used to generate a timeline for the gpu commands. After the code is compiled, the timeline is made with the following command
nsys profile -f true -o <output_name> --sample=cpu --stats true --trace cuda,nvtx ./bin/sdl -n 2 -v 0 -w 0 -l -i PU200
-f true forces an overwrite of the output file if it already exists The sdl code can be run with whatever options you wish. I usually turn of verbosity and ntuple output writing since this is the same condition as timing. You can run as many events as you want to with however many streams.
Individual kernel profiling is done using nvidia compute. Again, after compiling, this is run with the command
ncu --set full -o <output_name> -f --import-source on ./bin/sdl -n 1 -v 0 -i PU200
The resulting files from these commands can be transferred to your local machine and opened using nvidia's profiling suite.
The top most level of the code is found at bin/sdl.cc This loads the data from the ntuple (preloading step) and then runs our code over each event. It runs in this order:
- Create the number of streams set by multi streaming
- Run over the events (where each stream gets a chunk of events)
- Load hits onto the GPU
- Make mini doubles
- Make Line segments
- Make triplets
- Make T5
- Clean pixel line segments (excluded from timing)
- Make pT5
- Make pT3
- Make track Candidates
- Make track extensions
Each of these steps are written out in the host code within SDL/Events.cu Each step follows the general process.
- Use the maximum number of objects (set within Kernels.cuh) to allocate the maximum amount of space that might be needed
- Use information from module maps to estimate the number of objects that will be made
- Send that information to the device
- Launch the creation kernel
- Free any memory used within the function that isn’t needed elsewhere
- Run duplicate cleaning
- Add objects to events (mostly for readout purposes)
The bulk of the code is found in SDL/Kernels.cu This hosts all the kernel functions
Objects with individual functions are located in SDL/.cu. So for example, SDL/PixelQuintuplet.cu. This includes the memory allocations functions for that object as well as the Free functions. Usually the default algorithms for that object is stored there as well as the “addObjectToMemory” functions, which stores the object to be used later or read out.
Most of the writeout for the validation and extra studies are found in “code/core/write_sdl_ntuple.cc” This loops over the produced number of each object, skips over any that are marked as duplicates, then gets the relevant information and pushes the results back into vectors which are saved in the produced ntuple called debug.root.
Line Segment linking