Skip to content

Conversation

@csarofeen
Copy link
Collaborator

Extracted the precompiled header part of #5747 to see what it would do in isolation.

Precompiled Header (PCH) Build Optimization for nvFuser

What It Does

Precompiled Headers (PCH) pre-parse frequently-included header files once and cache the result, eliminating redundant parsing across hundreds of source files.

This branch adds:

  • PCH for 10 key nvFuser headers (polymorphic_value.h, type_traits.h, ir/base_nodes.h, etc.)
  • Shared PCH across 20+ test targets (prevents redundant PCH compilation)

Build Time Results

Compiler Baseline With PCH Wall-clock Improvement
GCC 20m 51s 17m 6s 18% faster
Clang 20m 43s 8m 48s 57% faster

CPU Time Results

Compiler Baseline With PCH CPU Time Reduction
GCC 231 min 185 min 20% less work
Clang 232 min 97 min 58% less work

Key Takeaway

PCH is a low-risk, high-impact optimization that can be merged independently. Clang users see the largest benefit (57% faster builds), while GCC users still gain a meaningful 18% improvement.

Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.
@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

CC @jacobhinkle

@github-actions
Copy link

Description

  • Add Precompiled Headers (PCH) for top 10 nvFuser headers to reduce build time by ~50%

  • Enable shared PCH across test targets to prevent redundant compilation

  • Achieve 18% faster builds for GCC and 57% for Clang users

  • Fix Clang unused-private-field warnings with [[maybe_unused]] attributes

Changes walkthrough

Relevant files
Bug fix
symmetric_tensor.h
Fix Clang unused private field warnings                                   

csrc/multidevice/symmetric_tensor.h

  • Add [[maybe_unused]] attributes to private fields mcast_handle_,
    cu_dev_, mc_ptr_, mc_base_ptr_, exporter_rank_, peer_fd_
  • Fix Clang unused-private-field warnings for unused private members
  • +5/-5     
    Enhancement
    CMakeLists.txt
    Add PCH optimization for build speed                                         

    CMakeLists.txt

  • Add PCH configuration for top 10 nvFuser headers by parse time
    (polymorphic_value.h, type_traits.h, etc.)
  • Enable PCH by default for Release builds with
    NVFUSER_USE_POLYMORPHIC_PCH option
  • Implement shared PCH across test targets to avoid redundant
    compilation
  • First test target creates PCH, subsequent targets reuse via REUSE_FROM
  • +57/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Missing Validation

    The [[maybe_unused]] attributes were added to member variables (mcast_handle_, cu_dev_, mc_base_ptr_, exporter_rank_, peer_fd_) but there's no explanation or validation that these changes don't affect functionality. These appear to be CUDA-related variables whose usage should be verified.

    [[maybe_unused]] CUmemGenericAllocationHandle mcast_handle_{};
    [[maybe_unused]] CUdevice cu_dev_{};
    void* mc_ptr_{nullptr};
    [[maybe_unused]] CUdeviceptr mc_base_ptr_{0};
    [[maybe_unused]] int exporter_rank_{-1};
    [[maybe_unused]] int peer_fd_{-1};
    PCH Configuration Robustness

    The PCH implementation relies on global property tracking for test targets and assumes consistent header availability. Need to verify that all 10 specified header files exist and are accessible in all build environments, and that the global property mechanism works correctly across different build configurations.

    if(NVFUSER_USE_POLYMORPHIC_PCH)
      get_property(NVFUSER_TEST_PCH_TARGET GLOBAL PROPERTY NVFUSER_TEST_PCH_TARGET)
      if(NOT NVFUSER_TEST_PCH_TARGET)
        # First test target: create the PCH with top 10 nvFuser headers
        message(STATUS "Creating shared test PCH on target: ${TEST_NAME}")
        target_precompile_headers(${TEST_NAME} PRIVATE
          "${NVFUSER_SRCS_DIR}/polymorphic_value.h"
          "${NVFUSER_ROOT}/lib/dynamic_type/src/dynamic_type/type_traits.h"
          "${NVFUSER_SRCS_DIR}/ir/base_nodes.h"
          "${NVFUSER_SRCS_DIR}/scheduler/tools/abstract_tensor.h"
          "${NVFUSER_SRCS_DIR}/type.h"
          "${NVFUSER_SRCS_DIR}/ir/container.h"
          "${NVFUSER_SRCS_DIR}/serde/fusion_cache_generated.h"
          "${NVFUSER_SRCS_DIR}/iter_visitor.h"
          "${NVFUSER_SRCS_DIR}/ir/internal_nodes.h"
          "${NVFUSER_SRCS_DIR}/ir/interface_nodes.h"
        )
        set_property(GLOBAL PROPERTY NVFUSER_TEST_PCH_TARGET ${TEST_NAME})
      else()
        # Subsequent test targets: reuse existing PCH
        target_precompile_headers(${TEST_NAME} REUSE_FROM ${NVFUSER_TEST_PCH_TARGET})
      endif()
    endif()
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants