[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747

csarofeen · 2026-01-02T20:10:08Z

Branch Summary: `m10-index-dispatch`

Pull Request: DynamicType Build Speed Optimization

Executive Summary

This PR reduces nvFuser clean build time by 54% for GCC (19m 34s → 8m 58s) and 64% for Clang (18m 13s → 6m 29s) through systematic optimization of DynamicType's template instantiation machinery. The core achievement is a 97.6% reduction in DynamicType compile time (34,113s → 824s) by replacing expensive tuple-based type iteration with C++20 fold expressions and index-based switch dispatch.

Clang builds are now 28% faster than GCC (6m 29s vs 8m 58s), making Clang the recommended compiler for development.

Build Time Results

Final Measurements (Verified)

Compiler	Original (main)	M10 Final	Improvement
GCC	19m 34s	8m 58s	-54%
Clang	18m 13s	6m 29s	-64%

Milestone Progression (GCC)

Milestone	Build Time	Change	Cumulative
Baseline (main)	19m 34s	—	—
M8 (Friend Functions)	15m 12s	-22%	-22%
M9 (PCH Expansion)	12m 42s	-16%	-35%
M10 (Index Dispatch + Fold)	8m 58s	-29%	-54%

Milestone Progression (Clang)

Milestone	Build Time	Change	Cumulative
Baseline (main)	18m 13s	—	—
M8 (Friend Functions)	15m 10s	-17%	-17%
M9 (PCH Expansion)	6m 44s	-56%	-63%
M10 (Index Dispatch + Fold)	6m 29s	-4%	-64%

Clang vs GCC Comparison

Milestone	GCC	Clang	Clang Advantage
Original (main)	19m 34s	18m 13s	-7%
M8	15m 12s	15m 10s	~0%
M9	12m 42s	6m 44s	-47%
M10	8m 58s	6m 29s	-28%

Note: PCH (M9) benefits Clang dramatically more than GCC (-56% vs -16%).

Template Instantiation

Metric	Original	Final	Reduction
DynamicType template time	34,113s	824s	-97.6%
Total template instantiation	~14,000s	9,781s	-30%

Technical Approach

Problem: Explosive Template Costs

DynamicType's 10-type variant (PolymorphicValue) required type iteration for operator overloading. The original approach used:

// Before: Tuple machinery for type iteration
cartesian_product(type_tuple_a, type_tuple_b)  // Generate 100 type pairs
  → std::tuple<std::tuple<type_identity<T1>, type_identity<T2>>, ...>
  → std::apply(lambda, tuple)  // Expensive noexcept checks
  → remove_void_from_tuple()   // Filter results

This created O(N²) template instantiations for every operator, multiplied across 237 translation units.

Solution: Three Complementary Optimizations

1. Index-Based Switch Dispatch (M8-M10)

Replace recursive template instantiation with flat switch statements:

// After: Flat switch on variant index
switch (a.index()) {
    case 0: switch (b.index()) { 
        case 0: return op(std::get<0>(a), std::get<0>(b)); 
        // ... 
    }
    // ...
}

Operators now dispatch via variant index at compile time, eliminating the recursive ForAllTypes template machinery.

2. fast_apply (Task 6)

Replace std::apply with a custom fast_apply that skips noexcept specification:

// std::apply triggers expensive noexcept checks:
// std::is_nothrow_invocable<F, T1&, T2&, ..., TN&>
//   └─ std::__invoke_result<...> × std::__call_is_nothrow<...>

// fast_apply: Direct parameter pack expansion, no noexcept overhead
template <typename F, typename Tuple, std::size_t... Is>
constexpr decltype(auto) fast_apply_impl(F&& f, Tuple&& t, std::index_sequence<Is...>) {
    return std::forward<F>(f)(std::get<Is>(std::forward<Tuple>(t))...);
}

Result: 14-17% build time reduction (Clang: 8m 14s → 7m 03s, GCC: 12m 28s → 10m 19s)

3. C++20 Fold Expressions + Requires (Tasks 7-9)

Replace tuple-based type iteration with direct fold expression expansion:

// Before: 80+ lines of tuple machinery
constexpr bool has_plus = any_check(
    [](auto lhs_t, auto rhs_t) { ... },
    cartesian_product(lhs_types, rhs_types));

// After: ~10 lines with fold expressions
template <typename L, typename... Rs>
constexpr bool check_l_vs_all_r() {
    return (... || requires(L l, Rs r) { l + r; });
}

template <typename... Ls, typename... Rs>
constexpr bool any_pair_supports_plus(TypeList<Ls...>, TypeList<Rs...>) {
    return (... || check_l_vs_all_r<Ls, Rs...>());
}

Result: 96% reduction in DynamicType compile time (34,113s → 1,332s → 824s)

Visibility Fix (Task 3)

Template-template parameters (Containers) break visibility attribute propagation. Fixed by compiling polymorphic_value.cpp with explicit visibility override:

set_source_files_properties(
  "${NVFUSER_SRCS_DIR}/polymorphic_value.cpp"
  PROPERTIES 
    SKIP_PRECOMPILE_HEADERS ON
    COMPILE_OPTIONS "-fvisibility=default"
)

This fixes symbol visibility issues with template-template parameters.

Key Files Modified

File	Change
`lib/dynamic_type/src/dynamic_type/type_traits.h`	TypeList infrastructure, fast_apply, fold helpers, deprecated old functions
`lib/dynamic_type/src/dynamic_type/decl.h`	Switch dispatch macros, fold-based operator checking, TypeListT exposure
`lib/dynamic_type/src/dynamic_type/impl.h`	Switch dispatch implementations for all operators
`csrc/polymorphic_value.h`	Extern template declaration
`csrc/polymorphic_value.cpp`	Explicit template instantiation
`csrc/type.h`	Removed `getDataType`/`castToDtype` implementations
`csrc/type.cpp`	Added `getDataType`/`castToDtype` implementations
`CMakeLists.txt`	PCH configuration (10 headers), visibility override

Test Status

Test Suite	Status
DynamicType library (72 tests)	✅ All pass
nvFuser PolymorphicValue tests (3 tests)	✅ All pass
Python import	✅ Works (0.2.35+gita16dfcd)

Current Bottleneck Analysis

After optimization, remaining compile time bottlenecks are:

Rank	Category	Time	Addressable?
1	Val-related templates	2,138s	⚠️ Limited (BFS is fundamental)
2	DynamicType std::variant	824s	❌ Inherent to 10-type variant
3	Destructor templates	279s	⚠️ Minor (unique_ptr cleanup)
4	External headers (PyTorch)	~400s	❌ Not nvFuser-controlled

The remaining DynamicType cost (~824s) is inherent to using a 10-type std::variant — this is fundamental to the polymorphic type design, not inefficient code. Further optimization would require architectural changes (type erasure, virtual dispatch).

Commits

Commit	Description
`a16dfcd4f`	Replace belongs_to/has_cross_type_equality with fold expressions
`6362a1ed8`	Replace any_check() with fold + requires pattern (96% DynamicType reduction)
`[fast_apply]`	Replace std::apply with fast_apply (skip noexcept machinery)
`40b0d2bab`	Explicit dispatch(), remove dispatch_deduce()
`ea40a920e`	dispatch() execution switch dispatch
`518198f87`	Fix symbol visibility (-fvisibility=default for polymorphic_value.cpp)
`4e50e4a4b`	All binary operators switch dispatch
`d2fdb387f`	Comparison operators switch dispatch
`89ae03c0f`	operator== switch dispatch
`321164986`	Expand PCH to include top nvFuser headers
`d9ab3518c`	Extend PCH to test targets (shared PCH)
`1c4484a27`	Enable narrow PCH for polymorphic_value.h
`436682850`	Move getDataType and castToDtype to type.cpp
`36a887906`	Refactor DynamicType operators to non-template friends
`fecdd77b3`	M8 Task 12: Convert remaining operators to friend functions
`473ae6b2e`	M8 Task 12: Convert 22 binary operators to friend pattern

Recommendations

For Development

Use Clang for development builds — 28% faster than GCC (6m 29s vs 8m 58s)
Keep GCC for CI — Ensures compatibility across compilers

Future Optimization Opportunities

Priority	Opportunity	Expected Impact
Medium	Split `python_bindings.cpp`	Improved parallelism
Medium	BFS template simplification	Address Val-related 2,138s
Low	Unity builds for CI	-65% clean build (incremental penalty)
Low	Additional PCH expansion	Diminishing returns

Branch Structure

main
 └── m8-friend-functions (36a887906)
      └── m9-pch (321164986)
           └── m10-index-dispatch (HEAD) ← Ready for merge

Last updated: 2026-01-11
Final measurements: Task 10 Report, Task 10b Clang Baselines

Preparatory refactor for wrapper class conversion. No behavior change - just moves the DynamicType alias into detail::DynamicTypeAlias and re-exports as PolymorphicValue.

This reverts commit 339731c.

…, wrapper dynamic_type.h)

…r extern template suppression. Reduces compile time by 56% and template instantiation by 75%.

…nment) to friend functions.

… guards. Fix tests for new error messages.

Reduces template instantiation by 28% by confining ForAllTypes dispatch to one TU.

Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.

Replaces ForAllTypes/dispatch with fold expression dispatch, eliminating template overhead.

…rAllTypes/Void overhead and fix Clang 18 template crash

…wise, named comparisons). Uses macro-generated switch statements supporting up to 16 type alternatives.

…cpp with -fvisibility=default. Resolves undefined symbol error when importing nvfuser.

csarofeen · 2026-01-02T20:10:15Z

!test

github-actions · 2026-01-02T20:11:30Z

Description

Split DynamicType into decl.h, impl.h, and wrapper headers for better build organization
Convert 22+ operators to friend functions reducing template instantiation by 75%
Move operators and functions from headers to implementation files to reduce compile time
Enable precompiled headers for polymorphic_value.h eliminating ~4000s of redundant parsing

Changes walkthrough

	Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Header Refactoring The main header file has been completely refactored to now just include "impl.h" instead of containing the full implementation. This is a breaking change for users who were relying on the previous header structure. The PR mentions this is for "compile-time optimization" but doesn't explain the migration path for existing users. // clang-format on #pragma once // Backward-compatible header - includes everything. // For compile-time optimization, include decl.h directly // and use extern template declarations. #include "impl.h" Template Instantiation An explicit template instantiation is added for DynamicType with specific types. This creates a single point of instantiation but may cause linking issues if the explicit instantiation doesn't match what other translation units expect. The -fvisibility=default requirement should be verified across different build configurations. // Explicit instantiation of DynamicType for PolymorphicValue. // This is the single point where the template is fully instantiated. // Note: This file is compiled with -fvisibility=default (set in CMakeLists.txt) // to ensure all DynamicType symbols are exported from the shared library. template struct dynamic_type::DynamicType< dynamic_type::Containers<std::vector>, nvfuser::StructHandle, nvfuser::Pointer, nvfuser::Opaque, at::Tensor, std::complex<double>, double, int64_t, bool>; Test Behavior Change Tests are converted from compile-time static_assert to runtime EXPECT_EQ tests, and error messages are changed from "Result is dynamic but not convertible to result type" to "Cannot compute". This changes the testing behavior from compile-time failures to runtime failures, which could affect test execution patterns and debugging workflows. EXPECT_EQ( \ (DoubleInt64Bool(2L) op DoubleInt64Bool(2.5)) \ .as<decltype(2L op 2.5)>(), \ (2L op 2.5)); \ EXPECT_EQ( \ (DoubleInt64Bool(2L) op DoubleInt64BoolTwo{}) \ .as<decltype(2L op 2L)>(), \ (2L op 2L)); \ EXPECT_EQ( \ (DoubleInt64BoolTwo {} op DoubleInt64Bool(2L)) \ .as<decltype(2L op 2L)>(), \ (2L op 2L)); \

Test failures

(High, 44) NCCL NVLS multicast memory bind failures in multi-device distributed tests (dtensor/matmul/overlap/transformer) on dlcluster_viking_ci

Test Name	H100 (dist.)	Source
tests.python.multidevice.test_communication.test_allgather	❌
tests.python.multidevice.test_communication.test_allgather_expanded_broadcast	❌
tests.python.multidevice.test_communication.test_allreduce	❌
tests.python.multidevice.test_communication.test_reduce_scatter	❌
tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous	❌
tests.python.multidevice.test_dtensor.test_column_parallel_linear	❌
tests.python.multidevice.test_dtensor.test_plus_one	❌
tests.python.multidevice.test_dtensor.test_row_parallel_linear	❌
tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine	❌
tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm	❌
... with 34 more test failures omitted. Check internal logs.

(High, 1) NCCL invalid usage error in multidevice overlap tests (test_overlap_allgather_matmul_shard_outermost)

Test Name H100 (dist.) Source

tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda] ❌

csarofeen · 2026-01-06T15:11:50Z

jacobhinkle · 2026-01-06T20:09:54Z

@csarofeen see #5546 which also uses PCH to achieve a speedup. I had that on hold but I plan to finish it soon by making the use of PCH optional.

Converts the runtime execution loop in dispatch() from ForAllTypes to index-based switch. Return type inference still uses ForAllTypes - to be addressed in follow-up.

…_deduce() and operator->(). Updates 22 production call sites and 8 operator->() usages. ~9% build time improvement.

…lock 8m14s → 7m03s (-14.4%), all tests pass.

…compile time reduction. Converts all 6 any_check sites to C++20 fold expressions, reducing DynamicType from 34,113s to 1,332s template time.

…ove deprecated cartesian_product/any_check. DynamicType: 1,332s -> 859s (35% reduction).

csarofeen · 2026-01-11T14:36:05Z

!test

csarofeen · 2026-01-11T14:40:14Z

@jacobhinkle this PR does overall build optimizations, and the headers were selected at a point in time in this PR where they made the most performance difference. I selected 10 to balance the number of files which use a lot of memory (the build should cache these in ram) but get a significant benefit.

It's also implemented to make sure that the headers are reused across the test suite, though they can't be reused across the test suite and core lib (different build flags).

I would be appreciative and happy for you to do this in the PR you've drafted, if you can cover these 10 files. I would be a bit nervous to go far beyond it as it can become counterproductive on lower memory RAM systems. I think covering these 10 files should be enabled by default.

Let me know if you have any other questions, comments, or concerns.

Short AI summary on the topic:

PCH Header Selection Rationale

We selected 10 headers for precompiled header (PCH) support based on exclusive parse time — the time spent parsing each header's own content, excluding nested includes. This distinction matters because traditional profiling reports inclusive time, which can be misleading. For example, ir/base_nodes.h appears to take 68 minutes to parse, but 93% of that time is actually spent in nested includes. The exclusive time is only 4.7 minutes — that's the real PCH savings.

Using Clang's -ftime-trace profiling across 237 translation units, we identified the headers with the highest exclusive parse costs. The top contributor is polymorphic_value.h at 27.9 minutes, which triggers DynamicType's template instantiation machinery. Next is type_traits.h (7.9m) containing template metaprogramming utilities, followed by core IR headers like ir/base_nodes.h (4.7m) and type.h (1.4m). Together, the top 10 headers account for ~48 minutes of cumulative parse time savings.

We explicitly excluded PyTorch headers (like ATen/core/jit_type.h) despite their high parse costs because they're external dependencies with varying build configurations. We also excluded "umbrella" headers like fusion.h and ir/all_nodes.h that appear expensive but have very low exclusive time — they simply aggregate other headers that are already covered.

The cutoff at 10 headers reflects diminishing returns: headers 2-8 each provide 7-8 minutes of savings, but after the top 10, remaining candidates offer less than 30 seconds each. Expanding PCH further would increase compile-time coupling and PCH file size (~500MB) without meaningful benefit.

Analysis: M9 Task 4-5, Clang -ftime-trace exclusive time methodology

csarofeen added 23 commits December 25, 2025 21:29

M8 Task 1a: Move PolymorphicValue alias to detail namespace

339731c

Preparatory refactor for wrapper class conversion. No behavior change - just moves the DynamicType alias into detail::DynamicTypeAlias and re-exports as PolymorphicValue.

Revert "M8 Task 1a: Move PolymorphicValue alias to detail namespace"

8bbb51f

This reverts commit 339731c.

M8 Task 2a: Create DynamicType split header structure (decl.h, impl.h…

b2999ee

…, wrapper dynamic_type.h)

M8 Task 3: Add extern template for PolymorphicValue DynamicType

a33f855

M8 Task 4: Move operator<< to impl.h

2a056e1

M8 Task 5: Move unary operators (+, -, ~, !) to impl.h

7bcace5

M8 Task 6: Move operator* (dereference) to impl.h

7666a7f

M8 Task 7: Move prefix ++/-- operators to impl.h

95d755f

M8 Task 8: Move postfix ++/-- operators to impl.h

3d45bdb

M8 Task 9: Move compound assignment operators to impl.h

2e4be09

M8 Task 10a: Move operator+ to impl.h (binary op pattern validation)

6e5f789

M8 Task 10b: Move all binary operators to impl.h

10e1dff

M8 Task 12: Convert 22 binary operators to friend function pattern fo…

473ae6b

…r extern template suppression. Reduces compile time by 56% and template instantiation by 75%.

M8 Task 12: Convert remaining operators (unary, ++/--, compound assig…

fecdd77

…nment) to friend functions.

Refactor DynamicType operators to non-template friends with recursion…

36a8879

… guards. Fix tests for new error messages.

Move getDataType and castToDtype from type.h to type.cpp

4366828

Reduces template instantiation by 28% by confining ForAllTypes dispatch to one TU.

Enable narrow PCH for polymorphic_value.h

1c4484a

Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.

Extend PCH to test targets

d9ab351

Expand PCH to include top nvFuser headers

3211649

Implement index-based switch dispatch for operator== in DynamicType

89ae03c

Replaces ForAllTypes/dispatch with fold expression dispatch, eliminating template overhead.

Convert comparison operators to switch-based dispatch to eliminate Fo…

d2fdb38

…rAllTypes/Void overhead and fix Clang 18 template crash

Extend switch-based dispatch to all binary operators (arithmetic, bit…

4e50e4a

…wise, named comparisons). Uses macro-generated switch statements supporting up to 16 type alternatives.

Fix symbol visibility for DynamicType by compiling polymorphic_value.…

518198f

…cpp with -fvisibility=default. Resolves undefined symbol error when importing nvfuser.

csarofeen added 3 commits January 9, 2026 09:00

Replace dispatch() execution ForAllTypes with switch dispatch

ea40a92

Converts the runtime execution loop in dispatch() from ForAllTypes to index-based switch. Return type inference still uses ForAllTypes - to be addressed in follow-up.

Require explicit return type for dispatch<ReturnT>(), remove dispatch…

40b0d2b

…_deduce() and operator->(). Updates 22 production call sites and 8 operator->() usages. ~9% build time improvement.

Replace std::apply with fast_apply to skip noexcept machinery. Wall c…

353402e

…lock 8m14s → 7m03s (-14.4%), all tests pass.

csarofeen added 2 commits January 10, 2026 16:24

Replace any_check() with fold + requires pattern for 96% DynamicType …

6362a1e

…compile time reduction. Converts all 6 any_check sites to C++20 fold expressions, reducing DynamicType from 34,113s to 1,332s template time.

Replace belongs_to/has_cross_type_equality with fold expressions, rem…

a16dfcd

…ove deprecated cartesian_product/any_check. DynamicType: 1,332s -> 859s (35% reduction).

csarofeen mentioned this pull request Jan 11, 2026

[Build Speed] PCH Test #5795

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747

[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747

csarofeen commented Jan 2, 2026 •

edited

Loading

Uh oh!

csarofeen commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026 •

edited by xwang233

Loading

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

csarofeen commented Jan 6, 2026

Uh oh!

jacobhinkle commented Jan 6, 2026

Uh oh!

csarofeen commented Jan 11, 2026

Uh oh!

csarofeen commented Jan 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747

Are you sure you want to change the base?

[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747

Conversation

csarofeen commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Branch Summary: m10-index-dispatch

Executive Summary

Build Time Results

Final Measurements (Verified)

Milestone Progression (GCC)

Milestone Progression (Clang)

Clang vs GCC Comparison

Template Instantiation

Technical Approach

Problem: Explosive Template Costs

Solution: Three Complementary Optimizations

1. Index-Based Switch Dispatch (M8-M10)

2. fast_apply (Task 6)

3. C++20 Fold Expressions + Requires (Tasks 7-9)

Visibility Fix (Task 3)

Key Files Modified

Test Status

Current Bottleneck Analysis

Commits

Recommendations

For Development

Future Optimization Opportunities

Branch Structure

Uh oh!

csarofeen commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

csarofeen commented Jan 6, 2026

Uh oh!

jacobhinkle commented Jan 6, 2026

Uh oh!

csarofeen commented Jan 11, 2026

Uh oh!

csarofeen commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PCH Header Selection Rationale

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

csarofeen commented Jan 2, 2026 •

edited

Loading

Branch Summary: `m10-index-dispatch`

github-actions bot commented Jan 2, 2026 •

edited by xwang233

Loading

csarofeen commented Jan 11, 2026 •

edited

Loading