You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am trying to use the resume() and pause() functions in IntelITT.jl to demarcate a block of code I want to profile with vtune. I ran the following command in a terminal.
vtune -collect performance-snapshot -start-paused julia runtests.jl
where runtests.jl is the same script you have in your tests folder.
The output is as follows. You'll see that all the clocktick percentages are zero.
vtune: Warning: Pause command is not supported for managed code profiling. Runtime overhead is still possible. Data size limit may be exceeded.
vtune: Collection paused.
vtune: Collection paused.
┌ Warning: No ITT collector present. Make sure you have a collector attached to your process, e.g., by running under VTune.
└ @ Main ~/.julia/packages/IntelITT/X4ecE/src/utilities.jl:5
Test Summary: | Pass Broken Total Time
IntelITT | 9 3 12 1.0s
vtune: Collection stopped.
vtune: Using result path `/home/cobra/GLNS_lazy_edge_eval.jl/r014ps'
vtune: Executing actions 75 % Generating a report Elapsed Time: 1.620s
| Application execution time is too short. Metrics data may be unreliable.
| Consider reducing the sampling interval or increasing your application
| execution time.
|
Logical Core Utilization: 0.0% (0.000 out of 10)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first step
| and then look at opportunities to utilize logical cores, which in some cases
| can improve processor throughput and overall performance of multi-threaded
| applications.
|
Physical Core Utilization: 0.0% (0.000 out of 10)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Run the HPC Performance Characterization analysis to estimate the
| efficiency of MPI and OpenMP parallelism or run the Locks and Waits
| analysis to identify parallel bottlenecks for other parallel runtimes.
|
Microarchitecture Usage: 0.0% of Pipeline Slots
| You code efficiency on this platform is too low.
|
| Possible cause: memory stalls, instruction starvation, branch misprediction
| or long latency instructions.
|
| Next steps: Run Microarchitecture Exploration analysis to identify the cause
| of the low microarchitecture usage efficiency.
|
Retiring: 0.0% of Pipeline Slots
Front-End Bound: 0.0% of Pipeline Slots
Bad Speculation: 0.0% of Pipeline Slots
Back-End Bound: 100.0% of Pipeline Slots
| A significant portion of pipeline slots are remaining empty. When
| operations take too long in the back-end, they introduce bubbles in the
| pipeline that ultimately cause fewer pipeline slots containing useful
| work to be retired per cycle than the machine is capable to support. This
| opportunity cost results in slower execution. Long-latency operations
| like divides and memory operations can cause this, as can too many
| operations being directed to a single execution port (for example, more
| multiply operations arriving in the back-end per cycle than the execution
| unit can support).
|
Memory Bound: 0.0% of Pipeline Slots
L1 Bound: 0.0% of Clockticks
DTLB Overhead: 0.0% of Clockticks
Load STLB Hit: 0.0% of Clockticks
Load STLB Miss: 0.0% of Clockticks
Loads Blocked by Store Forwarding: 0.0% of Clockticks
Lock Latency: 0.0% of Clockticks
Split Loads: 0.0% of Clockticks
4K Aliasing: 0.0% of Clockticks
FB Full: 0.0% of Clockticks
L2 Bound: 0.0% of Clockticks
L3 Bound: 0.0% of Clockticks
L3 Latency: 0.0% of Clockticks
SQ Full: 0.0% of Clockticks
DRAM Bound: 0.0% of Clockticks
Memory Bandwidth: 0.0% of Clockticks
Memory Latency: 0.0% of Clockticks
Local Memory: 0.0% of Clockticks
Remote Memory: 0.0% of Clockticks
Remote Cache: 0.0% of Clockticks
Store Bound: 0.0% of Clockticks
Store Latency: 0.0% of Clockticks
Split Stores: 0.0%
DTLB Store Overhead: 0.0% of Clockticks
Store STLB Hit: 0.0% of Clockticks
Store STLB Miss: 0.0% of Clockticks
Core Bound: 100.0% of Pipeline Slots
Memory Bound: 0.0% of Pipeline Slots
Cache Bound: 0.0% of Clockticks
DRAM Bound: 0.0% of Clockticks
NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
Instruction Mix
SP FLOPs: 0.0% of uOps
Packed: 0.0% from SP FP
128-bit: 0.0% from SP FP
256-bit: 0.0% from SP FP
512-bit: 0.0% from SP FP
Scalar: 0.0% from SP FP
DP FLOPs: 0.0% of uOps
Packed: 0.0% from DP FP
128-bit: 0.0% from DP FP
256-bit: 0.0% from DP FP
512-bit: 0.0% from DP FP
Scalar: 0.0% from DP FP
x87 FLOPs: 0.0% of uOps
Non-FP: 0.0% of uOps
FP Arith/Mem Rd Instr. Ratio: 0.000
FP Arith/Mem Wr Instr. Ratio: 0.000
Collection and Platform Info
Application Command Line: julia "runtests.jl"
Operating System: 5.15.0-130-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: cobra-lambda
Result Size: 3.7 MB
Collection start time: 22:19:27 31/01/2025 UTC
Collection stop time: 22:19:29 31/01/2025 UTC
Collector Type: Driverless Perf per-process counting
CPU
Name: Intel(R) Xeon(R) Processor code named Skylake
Frequency: 3.312 GHz
Logical CPU Count: 10
LLC size: 17.3 MB
Cache Allocation Technology
Level 2 capability: not detected
Level 3 capability: available
Recommendations:
Increase execution time:
| Application execution time is too short. Metrics data may be unreliable.
| Consider increasing your application execution time.
Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.
| Use Hotspots analysis to identify the most time consuming functions.
| Drill down to see the time spent on every line of code.
Threading: There is poor utilization of logical CPU cores (0.0%) in your application.
| Use Threading to explore more opportunities to increase parallelism in
| your application.
Microarchitecture Exploration: There is low microarchitecture usage (0.0%) of available hardware resources.
| Run Microarchitecture Exploration analysis to analyze CPU
| microarchitecture bottlenecks that can affect application performance.
If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done
On the other hand, if I don't run with -start-paused, i get an output with nonzero clocktick percentages:
vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/cobra/GLNS_lazy_edge_eval.jl/r015ps -command stop.
┌ Warning: No ITT collector present. Make sure you have a collector attached to your process, e.g., by running under VTune.
└ @ Main ~/.julia/packages/IntelITT/X4ecE/src/utilities.jl:5
Test Summary: | Pass Broken Total Time
IntelITT | 9 3 12 1.0s
vtune: Collection stopped.
vtune: Using result path `/home/cobra/GLNS_lazy_edge_eval.jl/r015ps'
vtune: Executing actions 75 % Generating a report Elapsed Time: 1.626s
IPC: 1.140
SP GFLOPS: 0.000
DP GFLOPS: 0.000
x87 GFLOPS: 0.000
Average CPU Frequency: 4.208 GHz
Logical Core Utilization: 11.9% (1.186 out of 10)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first step
| and then look at opportunities to utilize logical cores, which in some cases
| can improve processor throughput and overall performance of multi-threaded
| applications.
|
Physical Core Utilization: 11.9% (1.185 out of 10)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Run the HPC Performance Characterization analysis to estimate the
| efficiency of MPI and OpenMP parallelism or run the Locks and Waits
| analysis to identify parallel bottlenecks for other parallel runtimes.
|
Microarchitecture Usage: 24.7% of Pipeline Slots
| You code efficiency on this platform is too low.
|
| Possible cause: memory stalls, instruction starvation, branch misprediction
| or long latency instructions.
|
| Next steps: Run Microarchitecture Exploration analysis to identify the cause
| of the low microarchitecture usage efficiency.
|
Retiring: 24.7% of Pipeline Slots
Front-End Bound: 31.8% of Pipeline Slots
| Issue: A significant portion of Pipeline Slots is remaining empty due to
| issues in the Front-End.
|
| Tips: Make sure the code working size is not too large, the code layout
| does not require too many memory accesses per cycle to get enough
| instructions for filling four pipeline slots, or check for microcode
| assists.
|
Bad Speculation: 14.8% of Pipeline Slots
Back-End Bound: 28.7% of Pipeline Slots
| A significant portion of pipeline slots are remaining empty. When
| operations take too long in the back-end, they introduce bubbles in the
| pipeline that ultimately cause fewer pipeline slots containing useful
| work to be retired per cycle than the machine is capable to support. This
| opportunity cost results in slower execution. Long-latency operations
| like divides and memory operations can cause this, as can too many
| operations being directed to a single execution port (for example, more
| multiply operations arriving in the back-end per cycle than the execution
| unit can support).
|
Memory Bound: 10.4% of Pipeline Slots
L1 Bound: 8.1% of Clockticks
DTLB Overhead: 100.0% of Clockticks
Load STLB Hit: 100.0% of Clockticks
Load STLB Miss: 1.1% of Clockticks
Loads Blocked by Store Forwarding: 1.1% of Clockticks
Lock Latency: 0.0% of Clockticks
Split Loads: 0.2% of Clockticks
4K Aliasing: 0.4% of Clockticks
FB Full: 7.3% of Clockticks
L2 Bound: 1.4% of Clockticks
L3 Bound: 2.2% of Clockticks
L3 Latency: 5.5% of Clockticks
SQ Full: 0.2% of Clockticks
DRAM Bound: 7.0% of Clockticks
Memory Bandwidth: 1.5% of Clockticks
Memory Latency: 8.7% of Clockticks
Local Memory: 1.4% of Clockticks
Remote Memory: 0.0% of Clockticks
Remote Cache: 0.0% of Clockticks
Store Bound: 0.7% of Clockticks
Store Latency: 3.8% of Clockticks
Split Stores: 0.0%
DTLB Store Overhead: 1.4% of Clockticks
Store STLB Hit: 1.2% of Clockticks
Store STLB Miss: 0.2% of Clockticks
Core Bound: 18.3% of Pipeline Slots
| This metric represents how much Core non-memory issues were of a
| bottleneck. Shortage in hardware compute resources, or dependencies
| software's instructions are both categorized under Core Bound. Hence
| it may indicate the machine ran out of an OOO resources, certain
| execution units are overloaded or dependencies in program's data- or
| instruction- flow are limiting the performance (e.g. FP-chained long-
| latency arithmetic operations).
|
Memory Bound: 10.4% of Pipeline Slots
Cache Bound: 11.7% of Clockticks
DRAM Bound: 7.0% of Clockticks
NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
Instruction Mix
SP FLOPs: 0.0% of uOps
Packed: 0.0% from SP FP
128-bit: 0.0% from SP FP
256-bit: 0.0% from SP FP
512-bit: 0.0% from SP FP
Scalar: 100.0% from SP FP
DP FLOPs: 0.0% of uOps
Packed: 0.0% from DP FP
128-bit: 0.0% from DP FP
256-bit: 0.0% from DP FP
512-bit: 0.0% from DP FP
Scalar: 100.0% from DP FP
x87 FLOPs: 0.0% of uOps
Non-FP: 100.0% of uOps
FP Arith/Mem Rd Instr. Ratio: 0.001
FP Arith/Mem Wr Instr. Ratio: 0.001
Collection and Platform Info
Application Command Line: julia "runtests.jl"
Operating System: 5.15.0-130-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
Computer Name: cobra-lambda
Result Size: 3.7 MB
Collection start time: 22:23:26 31/01/2025 UTC
Collection stop time: 22:23:28 31/01/2025 UTC
Collector Type: Driverless Perf per-process counting
CPU
Name: Intel(R) Xeon(R) Processor code named Skylake
Frequency: 3.312 GHz
Logical CPU Count: 10
LLC size: 17.3 MB
Cache Allocation Technology
Level 2 capability: not detected
Level 3 capability: available
Recommendations:
Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.
| Use Hotspots analysis to identify the most time consuming functions.
| Drill down to see the time spent on every line of code.
Threading: There is poor utilization of logical CPU cores (11.9%) in your application.
| Use Threading to explore more opportunities to increase parallelism in
| your application.
Microarchitecture Exploration: There is low microarchitecture usage (24.7%) of available hardware resources.
| Run Microarchitecture Exploration analysis to analyze CPU
| microarchitecture bottlenecks that can affect application performance.
If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done
What are possible issues here?
The text was updated successfully, but these errors were encountered:
Hi,
I am trying to use the resume() and pause() functions in IntelITT.jl to demarcate a block of code I want to profile with vtune. I ran the following command in a terminal.
vtune -collect performance-snapshot -start-paused julia runtests.jl
where runtests.jl is the same script you have in your tests folder.
The output is as follows. You'll see that all the clocktick percentages are zero.
On the other hand, if I don't run with -start-paused, i get an output with nonzero clocktick percentages:
What are possible issues here?
The text was updated successfully, but these errors were encountered: