Skip to content

Conversation

@csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Jan 2, 2026

Branch Summary: m10-index-dispatch

Pull Request: DynamicType Build Speed Optimization


Executive Summary

This PR reduces nvFuser clean build time by 54% for GCC (19m 34s → 8m 58s) and 64% for Clang (18m 13s → 6m 29s) through systematic optimization of DynamicType's template instantiation machinery. The core achievement is a 97.6% reduction in DynamicType compile time (34,113s → 824s) by replacing expensive tuple-based type iteration with C++20 fold expressions and index-based switch dispatch.

Clang builds are now 28% faster than GCC (6m 29s vs 8m 58s), making Clang the recommended compiler for development.


Build Time Results

Final Measurements (Verified)

Compiler Original (main) M10 Final Improvement
GCC 19m 34s 8m 58s -54%
Clang 18m 13s 6m 29s -64%

Milestone Progression (GCC)

Milestone Build Time Change Cumulative
Baseline (main) 19m 34s
M8 (Friend Functions) 15m 12s -22% -22%
M9 (PCH Expansion) 12m 42s -16% -35%
M10 (Index Dispatch + Fold) 8m 58s -29% -54%

Milestone Progression (Clang)

Milestone Build Time Change Cumulative
Baseline (main) 18m 13s
M8 (Friend Functions) 15m 10s -17% -17%
M9 (PCH Expansion) 6m 44s -56% -63%
M10 (Index Dispatch + Fold) 6m 29s -4% -64%

Clang vs GCC Comparison

Milestone GCC Clang Clang Advantage
Original (main) 19m 34s 18m 13s -7%
M8 15m 12s 15m 10s ~0%
M9 12m 42s 6m 44s -47%
M10 8m 58s 6m 29s -28%

Note: PCH (M9) benefits Clang dramatically more than GCC (-56% vs -16%).

Template Instantiation

Metric Original Final Reduction
DynamicType template time 34,113s 824s -97.6%
Total template instantiation ~14,000s 9,781s -30%

Technical Approach

Problem: Explosive Template Costs

DynamicType's 10-type variant (PolymorphicValue) required type iteration for operator overloading. The original approach used:

// Before: Tuple machinery for type iteration
cartesian_product(type_tuple_a, type_tuple_b)  // Generate 100 type pairs
  → std::tuple<std::tuple<type_identity<T1>, type_identity<T2>>, ...>
  → std::apply(lambda, tuple)  // Expensive noexcept checks
  → remove_void_from_tuple()   // Filter results

This created O(N²) template instantiations for every operator, multiplied across 237 translation units.

Solution: Three Complementary Optimizations

1. Index-Based Switch Dispatch (M8-M10)

Replace recursive template instantiation with flat switch statements:

// After: Flat switch on variant index
switch (a.index()) {
    case 0: switch (b.index()) { 
        case 0: return op(std::get<0>(a), std::get<0>(b)); 
        // ... 
    }
    // ...
}

Operators now dispatch via variant index at compile time, eliminating the recursive ForAllTypes template machinery.

2. fast_apply (Task 6)

Replace std::apply with a custom fast_apply that skips noexcept specification:

// std::apply triggers expensive noexcept checks:
// std::is_nothrow_invocable<F, T1&, T2&, ..., TN&>
//   └─ std::__invoke_result<...> × std::__call_is_nothrow<...>

// fast_apply: Direct parameter pack expansion, no noexcept overhead
template <typename F, typename Tuple, std::size_t... Is>
constexpr decltype(auto) fast_apply_impl(F&& f, Tuple&& t, std::index_sequence<Is...>) {
    return std::forward<F>(f)(std::get<Is>(std::forward<Tuple>(t))...);
}

Result: 14-17% build time reduction (Clang: 8m 14s → 7m 03s, GCC: 12m 28s → 10m 19s)

3. C++20 Fold Expressions + Requires (Tasks 7-9)

Replace tuple-based type iteration with direct fold expression expansion:

// Before: 80+ lines of tuple machinery
constexpr bool has_plus = any_check(
    [](auto lhs_t, auto rhs_t) { ... },
    cartesian_product(lhs_types, rhs_types));

// After: ~10 lines with fold expressions
template <typename L, typename... Rs>
constexpr bool check_l_vs_all_r() {
    return (... || requires(L l, Rs r) { l + r; });
}

template <typename... Ls, typename... Rs>
constexpr bool any_pair_supports_plus(TypeList<Ls...>, TypeList<Rs...>) {
    return (... || check_l_vs_all_r<Ls, Rs...>());
}

Result: 96% reduction in DynamicType compile time (34,113s → 1,332s → 824s)


Visibility Fix (Task 3)

Template-template parameters (Containers) break visibility attribute propagation. Fixed by compiling polymorphic_value.cpp with explicit visibility override:

set_source_files_properties(
  "${NVFUSER_SRCS_DIR}/polymorphic_value.cpp"
  PROPERTIES 
    SKIP_PRECOMPILE_HEADERS ON
    COMPILE_OPTIONS "-fvisibility=default"
)

This fixes symbol visibility issues with template-template parameters.


Key Files Modified

File Change
lib/dynamic_type/src/dynamic_type/type_traits.h TypeList infrastructure, fast_apply, fold helpers, deprecated old functions
lib/dynamic_type/src/dynamic_type/decl.h Switch dispatch macros, fold-based operator checking, TypeListT exposure
lib/dynamic_type/src/dynamic_type/impl.h Switch dispatch implementations for all operators
csrc/polymorphic_value.h Extern template declaration
csrc/polymorphic_value.cpp Explicit template instantiation
csrc/type.h Removed getDataType/castToDtype implementations
csrc/type.cpp Added getDataType/castToDtype implementations
CMakeLists.txt PCH configuration (10 headers), visibility override

Test Status

Test Suite Status
DynamicType library (72 tests) ✅ All pass
nvFuser PolymorphicValue tests (3 tests) ✅ All pass
Python import ✅ Works (0.2.35+gita16dfcd)

Current Bottleneck Analysis

After optimization, remaining compile time bottlenecks are:

Rank Category Time Addressable?
1 Val-related templates 2,138s ⚠️ Limited (BFS is fundamental)
2 DynamicType std::variant 824s ❌ Inherent to 10-type variant
3 Destructor templates 279s ⚠️ Minor (unique_ptr cleanup)
4 External headers (PyTorch) ~400s ❌ Not nvFuser-controlled

The remaining DynamicType cost (~824s) is inherent to using a 10-type std::variant — this is fundamental to the polymorphic type design, not inefficient code. Further optimization would require architectural changes (type erasure, virtual dispatch).


Commits

Commit Description
a16dfcd4f Replace belongs_to/has_cross_type_equality with fold expressions
6362a1ed8 Replace any_check() with fold + requires pattern (96% DynamicType reduction)
[fast_apply] Replace std::apply with fast_apply (skip noexcept machinery)
40b0d2bab Explicit dispatch(), remove dispatch_deduce()
ea40a920e dispatch() execution switch dispatch
518198f87 Fix symbol visibility (-fvisibility=default for polymorphic_value.cpp)
4e50e4a4b All binary operators switch dispatch
d2fdb387f Comparison operators switch dispatch
89ae03c0f operator== switch dispatch
321164986 Expand PCH to include top nvFuser headers
d9ab3518c Extend PCH to test targets (shared PCH)
1c4484a27 Enable narrow PCH for polymorphic_value.h
436682850 Move getDataType and castToDtype to type.cpp
36a887906 Refactor DynamicType operators to non-template friends
fecdd77b3 M8 Task 12: Convert remaining operators to friend functions
473ae6b2e M8 Task 12: Convert 22 binary operators to friend pattern

Recommendations

For Development

  1. Use Clang for development builds — 28% faster than GCC (6m 29s vs 8m 58s)
  2. Keep GCC for CI — Ensures compatibility across compilers

Future Optimization Opportunities

Priority Opportunity Expected Impact
Medium Split python_bindings.cpp Improved parallelism
Medium BFS template simplification Address Val-related 2,138s
Low Unity builds for CI -65% clean build (incremental penalty)
Low Additional PCH expansion Diminishing returns

Branch Structure

main
 └── m8-friend-functions (36a887906)
      └── m9-pch (321164986)
           └── m10-index-dispatch (HEAD) ← Ready for merge

Last updated: 2026-01-11
Final measurements: Task 10 Report, Task 10b Clang Baselines

Preparatory refactor for wrapper class conversion.
No behavior change - just moves the DynamicType alias into
detail::DynamicTypeAlias and re-exports as PolymorphicValue.
…r extern template suppression. Reduces compile time by 56% and template instantiation by 75%.
Reduces template instantiation by 28% by confining ForAllTypes dispatch to one TU.
Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.
Replaces ForAllTypes/dispatch with fold expression dispatch, eliminating template overhead.
…rAllTypes/Void overhead and fix Clang 18 template crash
…wise, named comparisons). Uses macro-generated switch statements supporting up to 16 type alternatives.
…cpp with -fvisibility=default. Resolves undefined symbol error when importing nvfuser.
@csarofeen
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 2, 2026

Description

  • Split DynamicType into decl.h, impl.h, and wrapper headers for better build organization

  • Convert 22+ operators to friend functions reducing template instantiation by 75%

  • Move operators and functions from headers to implementation files to reduce compile time

  • Enable precompiled headers for polymorphic_value.h eliminating ~4000s of redundant parsing

Changes walkthrough

Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Header Refactoring

The main header file has been completely refactored to now just include "impl.h" instead of containing the full implementation. This is a breaking change for users who were relying on the previous header structure. The PR mentions this is for "compile-time optimization" but doesn't explain the migration path for existing users.

// clang-format on
#pragma once

// Backward-compatible header - includes everything.
// For compile-time optimization, include decl.h directly
// and use extern template declarations.

#include "impl.h"
Template Instantiation

An explicit template instantiation is added for DynamicType with specific types. This creates a single point of instantiation but may cause linking issues if the explicit instantiation doesn't match what other translation units expect. The -fvisibility=default requirement should be verified across different build configurations.

// Explicit instantiation of DynamicType for PolymorphicValue.
// This is the single point where the template is fully instantiated.
// Note: This file is compiled with -fvisibility=default (set in CMakeLists.txt)
// to ensure all DynamicType symbols are exported from the shared library.
template struct dynamic_type::DynamicType<
    dynamic_type::Containers<std::vector>,
    nvfuser::StructHandle,
    nvfuser::Pointer,
    nvfuser::Opaque,
    at::Tensor,
    std::complex<double>,
    double,
    int64_t,
    bool>;
Test Behavior Change

Tests are converted from compile-time static_assert to runtime EXPECT_EQ tests, and error messages are changed from "Result is dynamic but not convertible to result type" to "Cannot compute". This changes the testing behavior from compile-time failures to runtime failures, which could affect test execution patterns and debugging workflows.

EXPECT_EQ(                                                                 \
    (DoubleInt64Bool(2L) op DoubleInt64Bool(2.5))                          \
        .as<decltype(2L op 2.5)>(),                                        \
    (2L op 2.5));                                                          \
EXPECT_EQ(                                                                 \
    (DoubleInt64Bool(2L) op DoubleInt64BoolTwo{})                          \
        .as<decltype(2L op 2L)>(),                                         \
    (2L op 2L));                                                           \
EXPECT_EQ(                                                                 \
    (DoubleInt64BoolTwo {} op DoubleInt64Bool(2L))                         \
        .as<decltype(2L op 2L)>(),                                         \
    (2L op 2L));                                                           \

Test failures

  • (High, 44) NCCL NVLS multicast memory bind failures in multi-device distributed tests (dtensor/matmul/overlap/transformer) on dlcluster_viking_ci

    Test Name H100 (dist.) Source
    tests.python.multidevice.test_communication.test_allgather
    tests.python.multidevice.test_communication.test_allgather_expanded_broadcast
    tests.python.multidevice.test_communication.test_allreduce
    tests.python.multidevice.test_communication.test_reduce_scatter
    tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous
    tests.python.multidevice.test_dtensor.test_column_parallel_linear
    tests.python.multidevice.test_dtensor.test_plus_one
    tests.python.multidevice.test_dtensor.test_row_parallel_linear
    tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine
    tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm
    ... with 34 more test failures omitted. Check internal logs.
  • (High, 1) NCCL invalid usage error in multidevice overlap tests (test_overlap_allgather_matmul_shard_outermost)

    Test Name H100 (dist.) Source
    tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda]

@csarofeen
Copy link
Collaborator Author

image

@jacobhinkle
Copy link
Collaborator

@csarofeen see #5546 which also uses PCH to achieve a speedup. I had that on hold but I plan to finish it soon by making the use of PCH optional.

Converts the runtime execution loop in dispatch() from ForAllTypes to index-based switch. Return type inference still uses ForAllTypes - to be addressed in follow-up.
…_deduce() and operator->(). Updates 22 production call sites and 8 operator->() usages. ~9% build time improvement.
…lock 8m14s → 7m03s (-14.4%), all tests pass.
…compile time reduction. Converts all 6 any_check sites to C++20 fold expressions, reducing DynamicType from 34,113s to 1,332s template time.
…ove deprecated cartesian_product/any_check. DynamicType: 1,332s -> 859s (35% reduction).
@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

csarofeen commented Jan 11, 2026

@jacobhinkle this PR does overall build optimizations, and the headers were selected at a point in time in this PR where they made the most performance difference. I selected 10 to balance the number of files which use a lot of memory (the build should cache these in ram) but get a significant benefit.

It's also implemented to make sure that the headers are reused across the test suite, though they can't be reused across the test suite and core lib (different build flags).

I would be appreciative and happy for you to do this in the PR you've drafted, if you can cover these 10 files. I would be a bit nervous to go far beyond it as it can become counterproductive on lower memory RAM systems. I think covering these 10 files should be enabled by default.

Let me know if you have any other questions, comments, or concerns.

Short AI summary on the topic:

PCH Header Selection Rationale

We selected 10 headers for precompiled header (PCH) support based on exclusive parse time — the time spent parsing each header's own content, excluding nested includes. This distinction matters because traditional profiling reports inclusive time, which can be misleading. For example, ir/base_nodes.h appears to take 68 minutes to parse, but 93% of that time is actually spent in nested includes. The exclusive time is only 4.7 minutes — that's the real PCH savings.

Using Clang's -ftime-trace profiling across 237 translation units, we identified the headers with the highest exclusive parse costs. The top contributor is polymorphic_value.h at 27.9 minutes, which triggers DynamicType's template instantiation machinery. Next is type_traits.h (7.9m) containing template metaprogramming utilities, followed by core IR headers like ir/base_nodes.h (4.7m) and type.h (1.4m). Together, the top 10 headers account for ~48 minutes of cumulative parse time savings.

We explicitly excluded PyTorch headers (like ATen/core/jit_type.h) despite their high parse costs because they're external dependencies with varying build configurations. We also excluded "umbrella" headers like fusion.h and ir/all_nodes.h that appear expensive but have very low exclusive time — they simply aggregate other headers that are already covered.

The cutoff at 10 headers reflects diminishing returns: headers 2-8 each provide 7-8 minutes of savings, but after the top 10, remaining candidates offer less than 30 seconds each. Expanding PCH further would increase compile-time coupling and PCH file size (~500MB) without meaningful benefit.


Analysis: M9 Task 4-5, Clang -ftime-trace exclusive time methodology

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants