-
Notifications
You must be signed in to change notification settings - Fork 73
[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Preparatory refactor for wrapper class conversion. No behavior change - just moves the DynamicType alias into detail::DynamicTypeAlias and re-exports as PolymorphicValue.
This reverts commit 339731c.
…, wrapper dynamic_type.h)
…r extern template suppression. Reduces compile time by 56% and template instantiation by 75%.
…nment) to friend functions.
… guards. Fix tests for new error messages.
Reduces template instantiation by 28% by confining ForAllTypes dispatch to one TU.
Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.
Replaces ForAllTypes/dispatch with fold expression dispatch, eliminating template overhead.
…rAllTypes/Void overhead and fix Clang 18 template crash
…wise, named comparisons). Uses macro-generated switch statements supporting up to 16 type alternatives.
…cpp with -fvisibility=default. Resolves undefined symbol error when importing nvfuser.
|
!test |
Description
|
| Relevant files |
|---|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Header Refactoring
|
Test failures
-
(High, 44)
NCCL NVLS multicast memory bind failures in multi-device distributed tests (dtensor/matmul/overlap/transformer) on dlcluster_viking_ciTest Name H100 (dist.) Source tests.python.multidevice.test_communication.test_allgather ❌ tests.python.multidevice.test_communication.test_allgather_expanded_broadcast ❌ tests.python.multidevice.test_communication.test_allreduce ❌ tests.python.multidevice.test_communication.test_reduce_scatter ❌ tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous ❌ tests.python.multidevice.test_dtensor.test_column_parallel_linear ❌ tests.python.multidevice.test_dtensor.test_plus_one ❌ tests.python.multidevice.test_dtensor.test_row_parallel_linear ❌ tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine ❌ tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm ❌ ... with 34 more test failures omitted. Check internal logs. -
(High, 1)
NCCL invalid usage error in multidevice overlap tests (test_overlap_allgather_matmul_shard_outermost)Test Name H100 (dist.) Source tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda] ❌
|
@csarofeen see #5546 which also uses PCH to achieve a speedup. I had that on hold but I plan to finish it soon by making the use of PCH optional. |
Converts the runtime execution loop in dispatch() from ForAllTypes to index-based switch. Return type inference still uses ForAllTypes - to be addressed in follow-up.
…_deduce() and operator->(). Updates 22 production call sites and 8 operator->() usages. ~9% build time improvement.
…lock 8m14s → 7m03s (-14.4%), all tests pass.
…compile time reduction. Converts all 6 any_check sites to C++20 fold expressions, reducing DynamicType from 34,113s to 1,332s template time.
…ove deprecated cartesian_product/any_check. DynamicType: 1,332s -> 859s (35% reduction).
|
!test |
|
@jacobhinkle this PR does overall build optimizations, and the headers were selected at a point in time in this PR where they made the most performance difference. I selected 10 to balance the number of files which use a lot of memory (the build should cache these in ram) but get a significant benefit. It's also implemented to make sure that the headers are reused across the test suite, though they can't be reused across the test suite and core lib (different build flags). I would be appreciative and happy for you to do this in the PR you've drafted, if you can cover these 10 files. I would be a bit nervous to go far beyond it as it can become counterproductive on lower memory RAM systems. I think covering these 10 files should be enabled by default. Let me know if you have any other questions, comments, or concerns. Short AI summary on the topic: PCH Header Selection RationaleWe selected 10 headers for precompiled header (PCH) support based on exclusive parse time — the time spent parsing each header's own content, excluding nested includes. This distinction matters because traditional profiling reports inclusive time, which can be misleading. For example, Using Clang's We explicitly excluded PyTorch headers (like The cutoff at 10 headers reflects diminishing returns: headers 2-8 each provide 7-8 minutes of savings, but after the top 10, remaining candidates offer less than 30 seconds each. Expanding PCH further would increase compile-time coupling and PCH file size (~500MB) without meaningful benefit. Analysis: M9 Task 4-5, Clang -ftime-trace exclusive time methodology |

Branch Summary:
m10-index-dispatchPull Request: DynamicType Build Speed Optimization
Executive Summary
This PR reduces nvFuser clean build time by 54% for GCC (19m 34s → 8m 58s) and 64% for Clang (18m 13s → 6m 29s) through systematic optimization of DynamicType's template instantiation machinery. The core achievement is a 97.6% reduction in DynamicType compile time (34,113s → 824s) by replacing expensive tuple-based type iteration with C++20 fold expressions and index-based switch dispatch.
Clang builds are now 28% faster than GCC (6m 29s vs 8m 58s), making Clang the recommended compiler for development.
Build Time Results
Final Measurements (Verified)
Milestone Progression (GCC)
Milestone Progression (Clang)
Clang vs GCC Comparison
Note: PCH (M9) benefits Clang dramatically more than GCC (-56% vs -16%).
Template Instantiation
Technical Approach
Problem: Explosive Template Costs
DynamicType's 10-type variant (
PolymorphicValue) required type iteration for operator overloading. The original approach used:This created O(N²) template instantiations for every operator, multiplied across 237 translation units.
Solution: Three Complementary Optimizations
1. Index-Based Switch Dispatch (M8-M10)
Replace recursive template instantiation with flat switch statements:
Operators now dispatch via variant index at compile time, eliminating the recursive
ForAllTypestemplate machinery.2. fast_apply (Task 6)
Replace
std::applywith a customfast_applythat skips noexcept specification:Result: 14-17% build time reduction (Clang: 8m 14s → 7m 03s, GCC: 12m 28s → 10m 19s)
3. C++20 Fold Expressions + Requires (Tasks 7-9)
Replace tuple-based type iteration with direct fold expression expansion:
Result: 96% reduction in DynamicType compile time (34,113s → 1,332s → 824s)
Visibility Fix (Task 3)
Template-template parameters (
Containers) break visibility attribute propagation. Fixed by compilingpolymorphic_value.cppwith explicit visibility override:This fixes symbol visibility issues with template-template parameters.
Key Files Modified
lib/dynamic_type/src/dynamic_type/type_traits.hlib/dynamic_type/src/dynamic_type/decl.hlib/dynamic_type/src/dynamic_type/impl.hcsrc/polymorphic_value.hcsrc/polymorphic_value.cppcsrc/type.hgetDataType/castToDtypeimplementationscsrc/type.cppgetDataType/castToDtypeimplementationsCMakeLists.txtTest Status
Current Bottleneck Analysis
After optimization, remaining compile time bottlenecks are:
The remaining DynamicType cost (~824s) is inherent to using a 10-type std::variant — this is fundamental to the polymorphic type design, not inefficient code. Further optimization would require architectural changes (type erasure, virtual dispatch).
Commits
a16dfcd4f6362a1ed8[fast_apply]40b0d2babea40a920e518198f874e50e4a4bd2fdb387f89ae03c0f321164986d9ab3518c1c4484a2743668285036a887906fecdd77b3473ae6b2eRecommendations
For Development
Future Optimization Opportunities
python_bindings.cppBranch Structure
Last updated: 2026-01-11
Final measurements: Task 10 Report, Task 10b Clang Baselines