Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No profiling data is collected if I use vtune with -start-paused option #6

Open
EpicDuckPotato opened this issue Jan 31, 2025 · 1 comment

Comments

@EpicDuckPotato
Copy link

Hi,
I am trying to use the resume() and pause() functions in IntelITT.jl to demarcate a block of code I want to profile with vtune. I ran the following command in a terminal.

vtune -collect performance-snapshot -start-paused julia runtests.jl

where runtests.jl is the same script you have in your tests folder.

The output is as follows. You'll see that all the clocktick percentages are zero.

vtune: Warning: Pause command is not supported for managed code profiling. Runtime overhead is still possible. Data size limit may be exceeded.
vtune: Collection paused.
vtune: Collection paused.
┌ Warning: No ITT collector present. Make sure you have a collector attached to your process, e.g., by running under VTune.
└ @ Main ~/.julia/packages/IntelITT/X4ecE/src/utilities.jl:5
Test Summary: | Pass  Broken  Total  Time
IntelITT      |    9       3     12  1.0s
vtune: Collection stopped.
vtune: Using result path `/home/cobra/GLNS_lazy_edge_eval.jl/r014ps'
vtune: Executing actions 75 % Generating a report                              Elapsed Time: 1.620s
 | Application execution time is too short. Metrics data may be unreliable.
 | Consider reducing the sampling interval or increasing your application
 | execution time.
 |
Logical Core Utilization: 0.0% (0.000 out of 10)
 | The metric value is low, which may signal a poor logical CPU cores
 | utilization. Consider improving physical core utilization as the first step
 | and then look at opportunities to utilize logical cores, which in some cases
 | can improve processor throughput and overall performance of multi-threaded
 | applications.
 |
    Physical Core Utilization: 0.0% (0.000 out of 10)
     | The metric value is low, which may signal a poor physical CPU cores
     | utilization caused by:
     |     - load imbalance
     |     - threading runtime overhead
     |     - contended synchronization
     |     - thread/process underutilization
     |     - incorrect affinity that utilizes logical cores instead of physical
     |       cores
     | Run the HPC Performance Characterization analysis to estimate the
     | efficiency of MPI and OpenMP parallelism or run the Locks and Waits
     | analysis to identify parallel bottlenecks for other parallel runtimes.
     |
Microarchitecture Usage: 0.0% of Pipeline Slots
 | You code efficiency on this platform is too low.
 | 
 | Possible cause: memory stalls, instruction starvation, branch misprediction
 | or long latency instructions.
 | 
 | Next steps: Run Microarchitecture Exploration analysis to identify the cause
 | of the low microarchitecture usage efficiency.
 |
    Retiring: 0.0% of Pipeline Slots
    Front-End Bound: 0.0% of Pipeline Slots
    Bad Speculation: 0.0% of Pipeline Slots
    Back-End Bound: 100.0% of Pipeline Slots
     | A significant portion of pipeline slots are remaining empty. When
     | operations take too long in the back-end, they introduce bubbles in the
     | pipeline that ultimately cause fewer pipeline slots containing useful
     | work to be retired per cycle than the machine is capable to support. This
     | opportunity cost results in slower execution. Long-latency operations
     | like divides and memory operations can cause this, as can too many
     | operations being directed to a single execution port (for example, more
     | multiply operations arriving in the back-end per cycle than the execution
     | unit can support).
     |
        Memory Bound: 0.0% of Pipeline Slots
            L1 Bound: 0.0% of Clockticks
                DTLB Overhead: 0.0% of Clockticks
                    Load STLB Hit: 0.0% of Clockticks
                    Load STLB Miss: 0.0% of Clockticks
                Loads Blocked by Store Forwarding: 0.0% of Clockticks
                Lock Latency: 0.0% of Clockticks
                Split Loads: 0.0% of Clockticks
                4K Aliasing: 0.0% of Clockticks
                FB Full: 0.0% of Clockticks
            L2 Bound: 0.0% of Clockticks
            L3 Bound: 0.0% of Clockticks
                L3 Latency: 0.0% of Clockticks
                SQ Full: 0.0% of Clockticks
            DRAM Bound: 0.0% of Clockticks
                Memory Bandwidth: 0.0% of Clockticks
                Memory Latency: 0.0% of Clockticks
                    Local Memory: 0.0% of Clockticks
                    Remote Memory: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
            Store Bound: 0.0% of Clockticks
                Store Latency: 0.0% of Clockticks
                Split Stores: 0.0%
                DTLB Store Overhead: 0.0% of Clockticks
                    Store STLB Hit: 0.0% of Clockticks
                    Store STLB Miss: 0.0% of Clockticks
        Core Bound: 100.0% of Pipeline Slots
Memory Bound: 0.0% of Pipeline Slots
    Cache Bound: 0.0% of Clockticks
    DRAM Bound: 0.0% of Clockticks
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        SP FLOPs: 0.0% of uOps
            Packed: 0.0% from SP FP
                128-bit: 0.0% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 0.0% from SP FP
        DP FLOPs: 0.0% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 0.0% from DP FP
        x87 FLOPs: 0.0% of uOps
        Non-FP: 0.0% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.000
    FP Arith/Mem Wr Instr. Ratio: 0.000
Collection and Platform Info
    Application Command Line: julia "runtests.jl" 
    Operating System: 5.15.0-130-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    Computer Name: cobra-lambda
    Result Size: 3.7 MB 
    Collection start time: 22:19:27 31/01/2025 UTC
    Collection stop time: 22:19:29 31/01/2025 UTC
    Collector Type: Driverless Perf per-process counting
    CPU
        Name: Intel(R) Xeon(R) Processor code named Skylake
        Frequency: 3.312 GHz
        Logical CPU Count: 10
        LLC size: 17.3 MB 
        Cache Allocation Technology
            Level 2 capability: not detected
            Level 3 capability: available

Recommendations:
    Increase execution time: 
     | Application execution time is too short. Metrics data may be unreliable.
     | Consider increasing your application execution time.
    Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.
     | Use Hotspots analysis to identify the most time consuming functions.
     | Drill down to see the time spent on every line of code.
    Threading: There is poor utilization of logical CPU cores (0.0%) in your application.
     |  Use Threading to explore more opportunities to increase parallelism in
     | your application.
    Microarchitecture Exploration: There is low microarchitecture usage (0.0%) of available hardware resources.
     | Run Microarchitecture Exploration analysis to analyze CPU
     | microarchitecture bottlenecks that can affect application performance.

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done     

On the other hand, if I don't run with -start-paused, i get an output with nonzero clocktick percentages:

vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/cobra/GLNS_lazy_edge_eval.jl/r015ps -command stop.
┌ Warning: No ITT collector present. Make sure you have a collector attached to your process, e.g., by running under VTune.
└ @ Main ~/.julia/packages/IntelITT/X4ecE/src/utilities.jl:5
Test Summary: | Pass  Broken  Total  Time
IntelITT      |    9       3     12  1.0s
vtune: Collection stopped.
vtune: Using result path `/home/cobra/GLNS_lazy_edge_eval.jl/r015ps'
vtune: Executing actions 75 % Generating a report                              Elapsed Time: 1.626s
    IPC: 1.140
    SP GFLOPS: 0.000
    DP GFLOPS: 0.000
    x87 GFLOPS: 0.000
    Average CPU Frequency: 4.208 GHz
Logical Core Utilization: 11.9% (1.186 out of 10)
 | The metric value is low, which may signal a poor logical CPU cores
 | utilization. Consider improving physical core utilization as the first step
 | and then look at opportunities to utilize logical cores, which in some cases
 | can improve processor throughput and overall performance of multi-threaded
 | applications.
 |
    Physical Core Utilization: 11.9% (1.185 out of 10)
     | The metric value is low, which may signal a poor physical CPU cores
     | utilization caused by:
     |     - load imbalance
     |     - threading runtime overhead
     |     - contended synchronization
     |     - thread/process underutilization
     |     - incorrect affinity that utilizes logical cores instead of physical
     |       cores
     | Run the HPC Performance Characterization analysis to estimate the
     | efficiency of MPI and OpenMP parallelism or run the Locks and Waits
     | analysis to identify parallel bottlenecks for other parallel runtimes.
     |
Microarchitecture Usage: 24.7% of Pipeline Slots
 | You code efficiency on this platform is too low.
 | 
 | Possible cause: memory stalls, instruction starvation, branch misprediction
 | or long latency instructions.
 | 
 | Next steps: Run Microarchitecture Exploration analysis to identify the cause
 | of the low microarchitecture usage efficiency.
 |
    Retiring: 24.7% of Pipeline Slots
    Front-End Bound: 31.8% of Pipeline Slots
     | Issue: A significant portion of Pipeline Slots is remaining empty due to
     | issues in the Front-End.
     | 
     | Tips:  Make sure the code working size is not too large, the code layout
     | does not require too many memory accesses per cycle to get enough
     | instructions for filling four pipeline slots, or check for microcode
     | assists.
     |
    Bad Speculation: 14.8% of Pipeline Slots
    Back-End Bound: 28.7% of Pipeline Slots
     | A significant portion of pipeline slots are remaining empty. When
     | operations take too long in the back-end, they introduce bubbles in the
     | pipeline that ultimately cause fewer pipeline slots containing useful
     | work to be retired per cycle than the machine is capable to support. This
     | opportunity cost results in slower execution. Long-latency operations
     | like divides and memory operations can cause this, as can too many
     | operations being directed to a single execution port (for example, more
     | multiply operations arriving in the back-end per cycle than the execution
     | unit can support).
     |
        Memory Bound: 10.4% of Pipeline Slots
            L1 Bound: 8.1% of Clockticks
                DTLB Overhead: 100.0% of Clockticks
                    Load STLB Hit: 100.0% of Clockticks
                    Load STLB Miss: 1.1% of Clockticks
                Loads Blocked by Store Forwarding: 1.1% of Clockticks
                Lock Latency: 0.0% of Clockticks
                Split Loads: 0.2% of Clockticks
                4K Aliasing: 0.4% of Clockticks
                FB Full: 7.3% of Clockticks
            L2 Bound: 1.4% of Clockticks
            L3 Bound: 2.2% of Clockticks
                L3 Latency: 5.5% of Clockticks
                SQ Full: 0.2% of Clockticks
            DRAM Bound: 7.0% of Clockticks
                Memory Bandwidth: 1.5% of Clockticks
                Memory Latency: 8.7% of Clockticks
                    Local Memory: 1.4% of Clockticks
                    Remote Memory: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
            Store Bound: 0.7% of Clockticks
                Store Latency: 3.8% of Clockticks
                Split Stores: 0.0%
                DTLB Store Overhead: 1.4% of Clockticks
                    Store STLB Hit: 1.2% of Clockticks
                    Store STLB Miss: 0.2% of Clockticks
        Core Bound: 18.3% of Pipeline Slots
         | This metric represents how much Core non-memory issues were of a
         | bottleneck. Shortage in hardware compute resources, or dependencies
         | software's instructions are both categorized under Core Bound. Hence
         | it may indicate the machine ran out of an OOO resources, certain
         | execution units are overloaded or dependencies in program's data- or
         | instruction- flow are limiting the performance (e.g. FP-chained long-
         | latency arithmetic operations).
         |
Memory Bound: 10.4% of Pipeline Slots
    Cache Bound: 11.7% of Clockticks
    DRAM Bound: 7.0% of Clockticks
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        SP FLOPs: 0.0% of uOps
            Packed: 0.0% from SP FP
                128-bit: 0.0% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 100.0% from SP FP
        DP FLOPs: 0.0% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        x87 FLOPs: 0.0% of uOps
        Non-FP: 100.0% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.001
    FP Arith/Mem Wr Instr. Ratio: 0.001
Collection and Platform Info
    Application Command Line: julia "runtests.jl" 
    Operating System: 5.15.0-130-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
    Computer Name: cobra-lambda
    Result Size: 3.7 MB 
    Collection start time: 22:23:26 31/01/2025 UTC
    Collection stop time: 22:23:28 31/01/2025 UTC
    Collector Type: Driverless Perf per-process counting
    CPU
        Name: Intel(R) Xeon(R) Processor code named Skylake
        Frequency: 3.312 GHz
        Logical CPU Count: 10
        LLC size: 17.3 MB 
        Cache Allocation Technology
            Level 2 capability: not detected
            Level 3 capability: available

Recommendations:
    Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.
     | Use Hotspots analysis to identify the most time consuming functions.
     | Drill down to see the time spent on every line of code.
    Threading: There is poor utilization of logical CPU cores (11.9%) in your application.
     |  Use Threading to explore more opportunities to increase parallelism in
     | your application.
    Microarchitecture Exploration: There is low microarchitecture usage (24.7%) of available hardware resources.
     | Run Microarchitecture Exploration analysis to analyze CPU
     | microarchitecture bottlenecks that can affect application performance.

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done         

What are possible issues here?

@carstenbauer
Copy link
Member

Are you running under ] test? IIRC this spawns a new Julia process which, perhaps, might not be monitored. (just guessing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants