Implement generate_vbe_metadata cpu #3715

spcyppt · 2025-02-19T21:12:58Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/796

This diff implements generate_vbe_metadata for cpu, such that the function returns the same output for CPU, CUDA and MTIA.

To support VBE on CPU with existing fixed-batch-size CPU kernel, we need to recompute offsets, which is previously done in python. This diff implements offsets recomputation in C++ such that all manipulations are done in C++.

Note that reshaping offsets and grad_input to work with existing fixed-batch-size CPU kernels are done in Autograd instead of wrapper to avoid multiple computations.

VBE CPU tests are in the next diff.

Reviewed By: sryap

Differential Revision: D69162870

facebook-github-bot · 2025-02-19T21:13:07Z

This pull request was exported from Phabricator. Differential Revision: D69162870

netlify · 2025-02-19T21:13:17Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`3a59542`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67be996da593b4000864c698
😎 Deploy Preview	https://deploy-preview-3715--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#796 This diff implements `generate_vbe_metadata` for cpu, such that the function returns the same output for CPU, CUDA and MTIA. To support VBE on CPU with existing fixed-batch-size CPU kernel, we need to recompute offsets, which is previously done in python. This diff implements offsets recomputation in C++ such that all manipulations are done in C++. Note that reshaping offsets and grad_input to work with existing fixed-batch-size CPU kernels are done in Autograd instead of wrapper to avoid multiple computations. VBE CPU tests are in the next diff. Differential Revision: D69162870

facebook-github-bot · 2025-02-21T00:13:42Z

This pull request was exported from Phabricator. Differential Revision: D69162870

facebook-github-bot · 2025-02-21T23:46:31Z

This pull request was exported from Phabricator. Differential Revision: D69162870

Summary: Pull Request resolved: pytorch#3715 X-link: facebookresearch/FBGEMM#796 This diff implements `generate_vbe_metadata` for cpu, such that the function returns the same output for CPU, CUDA and MTIA. To support VBE on CPU with existing fixed-batch-size CPU kernel, we need to recompute offsets, which is previously done in python. This diff implements offsets recomputation in C++ such that all manipulations are done in C++. Note that reshaping offsets and grad_input to work with existing fixed-batch-size CPU kernels are done in Autograd instead of wrapper to avoid multiple computations. VBE CPU tests are in the next diff. Differential Revision: D69162870

facebook-github-bot · 2025-02-21T23:54:37Z

This pull request was exported from Phabricator. Differential Revision: D69162870

Summary: Pull Request resolved: pytorch#3715 X-link: facebookresearch/FBGEMM#796 This diff implements `generate_vbe_metadata` for cpu, such that the function returns the same output for CPU, CUDA and MTIA. To support VBE on CPU with existing fixed-batch-size CPU kernel, we need to recompute offsets, which is previously done in python. This diff implements offsets recomputation in C++ such that all manipulations are done in C++. Note that reshaping offsets and grad_input to work with existing fixed-batch-size CPU kernels are done in Autograd instead of wrapper to avoid multiple computations. VBE CPU tests are in the next diff. Differential Revision: D69162870

Summary: Annotate tensors in the schema. Reference doc: https://docs.google.com/document/d/1r_2-jPbqvoD069aEL_3pTiBOuYf1atu9L2ZQhA1Vut8/edit?tab=t.0 Differential Revision: D70210509

Summary: X-link: pytorch/torchrec#2751 X-link: facebookresearch/FBGEMM#793 **Backend**: D68054868 --- As the number of arguments in TBE keeps growing, some of the optimizers run into number of arguments limitation (i.e., 64) during pytorch operation registration. **For long-term growth and maintenance, we hence redesign TBE API by packing some of the arguments into list. Note that not all arguments are packed.** We pack the arguments as a list for each type. For **common** arguments, we pack - weights and arguments of type `Momentum` into TensorList - other tensors and optional tensors to list of optional tensors `aux_tensor` - `int` arguments into `aux_int` - `float` arguments into `aux_float` - `bool` arguments into `aux_bool`. Similarly for **optimizer-specific** arguments, we pack - arguments of type `Momentum` that are *__not__ optional* into TensorList - *optional* tensors to list of optional tensors `optim_tensor` - `int` arguments into `optim_int` - `float` arguments into `optim_float` - `bool` arguments into `optim_bool`. We see issues with pytorch registration across packing SymInt in python-C++, so we unroll and pass SymInt arguments individually. **This significantly reduces number of arguments.** For example, `split_embedding_codegen_lookup_rowwise_adagrad_with_counter_function`, which currently has 61 arguments only have 26 arguments with this API design. Please refer to the design doc on which arguments are packed and signature. Design doc: https://docs.google.com/document/d/1dCBg7dcf7Yq9FHVrvXsAmFtBxkDi9o6u0r-Ptd4UDPE/edit?tab=t.0#heading=h.6bip5pwqq8xb Full signature for each optimizer lookup function will be provided shortly. Reviewed By: sryap Differential Revision: D68055168

Summary: X-link: facebookresearch/FBGEMM#796 This diff implements `generate_vbe_metadata` for cpu, such that the function returns the same output for CPU, CUDA and MTIA. To support VBE on CPU with existing fixed-batch-size CPU kernel, we need to recompute offsets, which is previously done in python. This diff implements offsets recomputation in C++ such that all manipulations are done in C++. Note that reshaping offsets and grad_input to work with existing fixed-batch-size CPU kernels are done in Autograd instead of wrapper to avoid multiple computations. VBE CPU tests are in the next diff. Differential Revision: D69162870

facebook-github-bot added the cla signed label Feb 19, 2025

facebook-github-bot added the fb-exported label Feb 19, 2025

spcyppt force-pushed the export-D69162870 branch from 7529dfe to b2d0bcd Compare February 21, 2025 00:13

spcyppt force-pushed the export-D69162870 branch from b2d0bcd to aac9690 Compare February 21, 2025 23:46

spcyppt force-pushed the export-D69162870 branch from aac9690 to ae43025 Compare February 21, 2025 23:54

spcyppt added 3 commits February 25, 2025 20:29

annotate tensors in schema for PT2 interface

40a2cdc

Summary: Annotate tensors in the schema. Reference doc: https://docs.google.com/document/d/1r_2-jPbqvoD069aEL_3pTiBOuYf1atu9L2ZQhA1Vut8/edit?tab=t.0 Differential Revision: D70210509

spcyppt force-pushed the export-D69162870 branch 2 times, most recently from 90727e6 to 3a59542 Compare February 26, 2025 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement generate_vbe_metadata cpu #3715

Implement generate_vbe_metadata cpu #3715

spcyppt commented Feb 19, 2025

facebook-github-bot commented Feb 19, 2025

netlify bot commented Feb 19, 2025 •

edited

Loading

facebook-github-bot commented Feb 21, 2025

facebook-github-bot commented Feb 21, 2025

facebook-github-bot commented Feb 21, 2025

Implement generate_vbe_metadata cpu #3715

Are you sure you want to change the base?

Implement generate_vbe_metadata cpu #3715

Conversation

spcyppt commented Feb 19, 2025

facebook-github-bot commented Feb 19, 2025

netlify bot commented Feb 19, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Feb 21, 2025

facebook-github-bot commented Feb 21, 2025

facebook-github-bot commented Feb 21, 2025

netlify bot commented Feb 19, 2025 •

edited

Loading