No profiling data is collected if I use vtune with -start-paused option #6

EpicDuckPotato · 2025-01-31T22:28:14Z

Hi,
I am trying to use the resume() and pause() functions in IntelITT.jl to demarcate a block of code I want to profile with vtune. I ran the following command in a terminal.

vtune -collect performance-snapshot -start-paused julia runtests.jl

where runtests.jl is the same script you have in your tests folder.

The output is as follows. You'll see that all the clocktick percentages are zero.

vtune: Warning: Pause command is not supported for managed code profiling. Runtime overhead is still possible. Data size limit may be exceeded.
vtune: Collection paused.
vtune: Collection paused.
┌ Warning: No ITT collector present. Make sure you have a collector attached to your process, e.g., by running under VTune.
└ @ Main ~/.julia/packages/IntelITT/X4ecE/src/utilities.jl:5
Test Summary: | Pass  Broken  Total  Time
IntelITT      |    9       3     12  1.0s
vtune: Collection stopped.
vtune: Using result path `/home/cobra/GLNS_lazy_edge_eval.jl/r014ps'
vtune: Executing actions 75 % Generating a report                              Elapsed Time: 1.620s
 | Application execution time is too short. Metrics data may be unreliable.
 | Consider reducing the sampling interval or increasing your application
 | execution time.
 |
Logical Core Utilization: 0.0% (0.000 out of 10)
 | The metric value is low, which may signal a poor logical CPU cores
 | utilization. Consider improving physical core utilization as the first step
 | and then look at opportunities to utilize logical cores, which in some cases
 | can improve processor throughput and overall performance of multi-threaded
 | applications.
 |
    Physical Core Utilization: 0.0% (0.000 out of 10)
     | The metric value is low, which may signal a poor physical CPU cores
     | utilization caused by:
     |     - load imbalance
     |     - threading runtime overhead
     |     - contended synchronization
     |     - thread/process underutilization
     |     - incorrect affinity that utilizes logical cores instead of physical
     |       cores
     | Run the HPC Performance Characterization analysis to estimate the
     | efficiency of MPI and OpenMP parallelism or run the Locks and Waits
     | analysis to identify parallel bottlenecks for other parallel runtimes.
     |
Microarchitecture Usage: 0.0% of Pipeline Slots
 | You code efficiency on this platform is too low.
 | 
 | Possible cause: memory stalls, instruction starvation, branch misprediction
 | or long latency instructions.
 | 
 | Next steps: Run Microarchitecture Exploration analysis to identify the cause
 | of the low microarchitecture usage efficiency.
 |
    Retiring: 0.0% of Pipeline Slots
    Front-End Bound: 0.0% of Pipeline Slots
    Bad Speculation: 0.0% of Pipeline Slots
    Back-End Bound: 100.0% of Pipeline Slots
     | A significant portion of pipeline slots are remaining empty. When
     | operations take too long in the back-end, they introduce bubbles in the
     | pipeline that ultimately cause fewer pipeline slots containing useful
     | work to be retired per cycle than the machine is capable to support. This
     | opportunity cost results in slower execution. Long-latency operations
     | like divides and memory operations can cause this, as can too many
     | operations being directed to a single execution port (for example, more
     | multiply operations arriving in the back-end per cycle than the execution
     | unit can support).
     |
        Memory Bound: 0.0% of Pipeline Slots
            L1 Bound: 0.0% of Clockticks
                DTLB Overhead: 0.0% of Clockticks
                    Load STLB Hit: 0.0% of Clockticks
                    Load STLB Miss: 0.0% of Clockticks
                Loads Blocked by Store Forwarding: 0.0% of Clockticks
                Lock Latency: 0.0% of Clockticks
                Split Loads: 0.0% of Clockticks
                4K Aliasing: 0.0% of Clockticks
                FB Full: 0.0% of Clockticks
            L2 Bound: 0.0% of Clockticks
            L3 Bound: 0.0% of Clockticks
                L3 Latency: 0.0% of Clockticks
                SQ Full: 0.0% of Clockticks
            DRAM Bound: 0.0% of Clockticks
                Memory Bandwidth: 0.0% of Clockticks
                Memory Latency: 0.0% of Clockticks
                    Local Memory: 0.0% of Clockticks
                    Remote Memory: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
            Store Bound: 0.0% of Clockticks
                Store Latency: 0.0% of Clockticks
                Split Stores: 0.0%
                DTLB Store Overhead: 0.0% of Clockticks
                    Store STLB Hit: 0.0% of Clockticks
                    Store STLB Miss: 0.0% of Clockticks
        Core Bound: 100.0% of Pipeline Slots
Memory Bound: 0.0% of Pipeline Slots
    Cache Bound: 0.0% of Clockticks
    DRAM Bound: 0.0% of Clockticks
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        SP FLOPs: 0.0% of uOps
            Packed: 0.0% from SP FP
                128-bit: 0.0% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 0.0% from SP FP
        DP FLOPs: 0.0% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 0.0% from DP FP
        x87 FLOPs: 0.0% of uOps
        Non-FP: 0.0% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.000
    FP Arith/Mem Wr Instr. Ratio: 0.000
Collection and Platform Info
    Application Command Line: julia "runtests.jl" 
    Operating System: 5.15.0-130-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    Computer Name: cobra-lambda
    Result Size: 3.7 MB 
    Collection start time: 22:19:27 31/01/2025 UTC
    Collection stop time: 22:19:29 31/01/2025 UTC
    Collector Type: Driverless Perf per-process counting
    CPU
        Name: Intel(R) Xeon(R) Processor code named Skylake
        Frequency: 3.312 GHz
        Logical CPU Count: 10
        LLC size: 17.3 MB 
        Cache Allocation Technology
            Level 2 capability: not detected
            Level 3 capability: available

Recommendations:
    Increase execution time: 
     | Application execution time is too short. Metrics data may be unreliable.
     | Consider increasing your application execution time.
    Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.
     | Use Hotspots analysis to identify the most time consuming functions.
     | Drill down to see the time spent on every line of code.
    Threading: There is poor utilization of logical CPU cores (0.0%) in your application.
     |  Use Threading to explore more opportunities to increase parallelism in
     | your application.
    Microarchitecture Exploration: There is low microarchitecture usage (0.0%) of available hardware resources.
     | Run Microarchitecture Exploration analysis to analyze CPU
     | microarchitecture bottlenecks that can affect application performance.

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done

On the other hand, if I don't run with -start-paused, i get an output with nonzero clocktick percentages:

vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/cobra/GLNS_lazy_edge_eval.jl/r015ps -command stop.
┌ Warning: No ITT collector present. Make sure you have a collector attached to your process, e.g., by running under VTune.
└ @ Main ~/.julia/packages/IntelITT/X4ecE/src/utilities.jl:5
Test Summary: | Pass  Broken  Total  Time
IntelITT      |    9       3     12  1.0s
vtune: Collection stopped.
vtune: Using result path `/home/cobra/GLNS_lazy_edge_eval.jl/r015ps'
vtune: Executing actions 75 % Generating a report                              Elapsed Time: 1.626s
    IPC: 1.140
    SP GFLOPS: 0.000
    DP GFLOPS: 0.000
    x87 GFLOPS: 0.000
    Average CPU Frequency: 4.208 GHz
Logical Core Utilization: 11.9% (1.186 out of 10)
 | The metric value is low, which may signal a poor logical CPU cores
 | utilization. Consider improving physical core utilization as the first step
 | and then look at opportunities to utilize logical cores, which in some cases
 | can improve processor throughput and overall performance of multi-threaded
 | applications.
 |
    Physical Core Utilization: 11.9% (1.185 out of 10)
     | The metric value is low, which may signal a poor physical CPU cores
     | utilization caused by:
     |     - load imbalance
     |     - threading runtime overhead
     |     - contended synchronization
     |     - thread/process underutilization
     |     - incorrect affinity that utilizes logical cores instead of physical
     |       cores
     | Run the HPC Performance Characterization analysis to estimate the
     | efficiency of MPI and OpenMP parallelism or run the Locks and Waits
     | analysis to identify parallel bottlenecks for other parallel runtimes.
     |
Microarchitecture Usage: 24.7% of Pipeline Slots
 | You code efficiency on this platform is too low.
 | 
 | Possible cause: memory stalls, instruction starvation, branch misprediction
 | or long latency instructions.
 | 
 | Next steps: Run Microarchitecture Exploration analysis to identify the cause
 | of the low microarchitecture usage efficiency.
 |
    Retiring: 24.7% of Pipeline Slots
    Front-End Bound: 31.8% of Pipeline Slots
     | Issue: A significant portion of Pipeline Slots is remaining empty due to
     | issues in the Front-End.
     | 
     | Tips:  Make sure the code working size is not too large, the code layout
     | does not require too many memory accesses per cycle to get enough
     | instructions for filling four pipeline slots, or check for microcode
     | assists.
     |
    Bad Speculation: 14.8% of Pipeline Slots
    Back-End Bound: 28.7% of Pipeline Slots
     | A significant portion of pipeline slots are remaining empty. When
     | operations take too long in the back-end, they introduce bubbles in the
     | pipeline that ultimately cause fewer pipeline slots containing useful
     | work to be retired per cycle than the machine is capable to support. This
     | opportunity cost results in slower execution. Long-latency operations
     | like divides and memory operations can cause this, as can too many
     | operations being directed to a single execution port (for example, more
     | multiply operations arriving in the back-end per cycle than the execution
     | unit can support).
     |
        Memory Bound: 10.4% of Pipeline Slots
            L1 Bound: 8.1% of Clockticks
                DTLB Overhead: 100.0% of Clockticks
                    Load STLB Hit: 100.0% of Clockticks
                    Load STLB Miss: 1.1% of Clockticks
                Loads Blocked by Store Forwarding: 1.1% of Clockticks
                Lock Latency: 0.0% of Clockticks
                Split Loads: 0.2% of Clockticks
                4K Aliasing: 0.4% of Clockticks
                FB Full: 7.3% of Clockticks
            L2 Bound: 1.4% of Clockticks
            L3 Bound: 2.2% of Clockticks
                L3 Latency: 5.5% of Clockticks
                SQ Full: 0.2% of Clockticks
            DRAM Bound: 7.0% of Clockticks
                Memory Bandwidth: 1.5% of Clockticks
                Memory Latency: 8.7% of Clockticks
                    Local Memory: 1.4% of Clockticks
                    Remote Memory: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
            Store Bound: 0.7% of Clockticks
                Store Latency: 3.8% of Clockticks
                Split Stores: 0.0%
                DTLB Store Overhead: 1.4% of Clockticks
                    Store STLB Hit: 1.2% of Clockticks
                    Store STLB Miss: 0.2% of Clockticks
        Core Bound: 18.3% of Pipeline Slots
         | This metric represents how much Core non-memory issues were of a
         | bottleneck. Shortage in hardware compute resources, or dependencies
         | software's instructions are both categorized under Core Bound. Hence
         | it may indicate the machine ran out of an OOO resources, certain
         | execution units are overloaded or dependencies in program's data- or
         | instruction- flow are limiting the performance (e.g. FP-chained long-
         | latency arithmetic operations).
         |
Memory Bound: 10.4% of Pipeline Slots
    Cache Bound: 11.7% of Clockticks
    DRAM Bound: 7.0% of Clockticks
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        SP FLOPs: 0.0% of uOps
            Packed: 0.0% from SP FP
                128-bit: 0.0% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 100.0% from SP FP
        DP FLOPs: 0.0% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        x87 FLOPs: 0.0% of uOps
        Non-FP: 100.0% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.001
    FP Arith/Mem Wr Instr. Ratio: 0.001
Collection and Platform Info
    Application Command Line: julia "runtests.jl" 
    Operating System: 5.15.0-130-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    Computer Name: cobra-lambda
    Result Size: 3.7 MB 
    Collection start time: 22:23:26 31/01/2025 UTC
    Collection stop time: 22:23:28 31/01/2025 UTC
    Collector Type: Driverless Perf per-process counting
    CPU
        Name: Intel(R) Xeon(R) Processor code named Skylake
        Frequency: 3.312 GHz
        Logical CPU Count: 10
        LLC size: 17.3 MB 
        Cache Allocation Technology
            Level 2 capability: not detected
            Level 3 capability: available

Recommendations:
    Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.
     | Use Hotspots analysis to identify the most time consuming functions.
     | Drill down to see the time spent on every line of code.
    Threading: There is poor utilization of logical CPU cores (11.9%) in your application.
     |  Use Threading to explore more opportunities to increase parallelism in
     | your application.
    Microarchitecture Exploration: There is low microarchitecture usage (24.7%) of available hardware resources.
     | Run Microarchitecture Exploration analysis to analyze CPU
     | microarchitecture bottlenecks that can affect application performance.

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done

What are possible issues here?

The text was updated successfully, but these errors were encountered:

carstenbauer · 2025-02-03T09:18:02Z

Are you running under ] test? IIRC this spawns a new Julia process which, perhaps, might not be monitored. (just guessing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No profiling data is collected if I use vtune with -start-paused option #6

No profiling data is collected if I use vtune with -start-paused option #6

EpicDuckPotato commented Jan 31, 2025

carstenbauer commented Feb 3, 2025

No profiling data is collected if I use vtune with -start-paused option #6

No profiling data is collected if I use vtune with -start-paused option #6

Comments

EpicDuckPotato commented Jan 31, 2025

carstenbauer commented Feb 3, 2025