New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Modify Parallelization Strategy to Make it More General #1988

Merged

zhenglongjiepheonix merged 13 commits into huggingface:main from zhenglongjiepheonix:longjie/generalize_parallelization_strategy

Sep 2, 2024

Contributor

zhenglongjiepheonix commented Aug 14, 2024 •

edited

Loading

As per title, this PR tries a more general approach rather than relying purely on human heuristics, basically it uses the following steps to search a possible parallelization strategy for a transformer model

Use dynamo for graph tracing so that we get the graph to operate on
Decompose and functionalize the traced graph so that we get a smaller op set to work with
Apply parallel axis analysis and do a constrained backtracking search on the whole graph to get a possible solution(not necessarily optimal)
Replace ops the original traced graph with their parallelized version(Linear -> ColumnLinear/RowLinear)

And for the API design, we disable the support of passing custom modules and only focus on models in transformers because supporting custom models is not the priority for now.

zhenglongjiepheonix added 2 commits

August 14, 2024 03:05


          modify parallelization strategy

50fcfc0


          only support model id in api now

4114d3b

HuggingFaceDocBuilderDev commented Aug 14, 2024

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zhenglongjiepheonix added 3 commits

August 15, 2024 21:49


          more comments

c689402


          more comments

252c3b7


          Merge remote-tracking branch 'upstream/main' into longjie/generalize_…

1be77ed

…parallelization_strategy

zhenglongjiepheonix requested a review from michaelbenayoun

August 19, 2024 18:42

michaelbenayoun reviewed

View reviewed changes

optimum/fx/parallelization/api.py Show resolved Hide resolved

optimum/fx/parallelization/api.py Outdated

                   """
                   API for automatic model parallelism through Pytorch FX.
                   Args:
-                      model (Union[torch.nn.Module, str]):
-                          Model to parallelize, could either be a module or a model id on the Huggingface Hub.
+                      model (str):

Member

michaelbenayoun Aug 20, 2024

Suggested change

      
                    model (str):
          
                    model (`str`):

optimum/fx/parallelization/decomp.py Outdated

Comment on lines 63 to 67

+              class DecompTracer(GraphAppendingTracer):
+                  def __init__(self, graph: Graph):
+                      super().__init__(graph)
+                      self.tensor_tracker = WeakTensorKeyDictionary()
+                      self.symnode_tracker = _SymNodeDict()

Member

michaelbenayoun Aug 20, 2024

Maybe a docstring explaining what it does.

optimum/fx/parallelization/decomp.py Show resolved Hide resolved

optimum/fx/parallelization/decomp.py Outdated

Comment on lines 74 to 79

+                  that certain primitive layers(like `nn.Linear`, `nn.Embedding`, and activation layers) are preserved because we have specific
+                  heuristic based parallelization strategy for them so that we can conveniently replace them into their parallelized counterparts
+                  in the orignal graph module.
+                  Note that the traced graph is a low-level equivalent representation of the original graph module, and is only used for
+                  parallel axis propagation and analysis, the original graph module is still used for real execution.

Member

michaelbenayoun Aug 20, 2024

Can you group notes as follows:

Notes:
1. Certain primitive layers ....
2. The traced graph is a low-level equivalent...

optimum/fx/parallelization/decomp.py Outdated

+                  leaf_function_targets: List[Callable] = [F.scaled_dot_product_attention],
+              ) -> Callable:
+                  """
+                  API to decompose and funcitonalize a high-level graph module.

Member

michaelbenayoun Aug 20, 2024

Suggested change

      
                API to decompose and funcitonalize a high-level graph module.
          
                API to decompose and functionalize a high-level graph module.

optimum/fx/parallelization/decomp.py Outdated

Comment on lines 196 to 200

+                      graph_module (GraphModule):
+                          The high-level graph module to be decomposed and functionalized.
+                      decomposition_table (Dict[torch._ops.OperatorBase, Callable], defaults to `core_aten_decompostions()`):
+                          The lookup table which maps high-level torch op to their equivalent low-level implementation.
+                      leaf_function_targets (List[Callable], defaults to `[F.scaled_dot_product_attention]`):

Member

michaelbenayoun Aug 20, 2024

Suggested change

      
                    graph_module (GraphModule):
          
                        The high-level graph module to be decomposed and functionalized.
          
                    decomposition_table (Dict[torch._ops.OperatorBase, Callable], defaults to `core_aten_decompostions()`):
          
                        The lookup table which maps high-level torch op to their equivalent low-level implementation.
          
                    leaf_function_targets (List[Callable], defaults to `[F.scaled_dot_product_attention]`):
          
                    graph_module (`GraphModule`):
          
                        The high-level graph module to be decomposed and functionalized.
          
                    decomposition_table (`Dict[torch._ops.OperatorBase, Callable]`, defaults to `core_aten_decompostions()`):
          
                        The lookup table which maps high-level torch op to their equivalent low-level implementation.
          
                    leaf_function_targets (`List[Callable]`, defaults to `[F.scaled_dot_product_attention]`):

optimum/fx/parallelization/op_registry/op_handlers.py Outdated

Comment on lines 25 to 41

+              class Registry:
+                  def __init__(self) -> None:
+                      self.mapping = {}
+                  def register(self, op_types):
+                      def wrapper(cls):
+                          if isinstance(op_types, (list, tuple)):
+                              for op_type in op_types:
+                                  self.mapping[op_type] = cls
+                          else:
+                              self.mapping[op_types] = cls
+                          return cls
+                      return wrapper
+                  def is_supported(self, op_type) -> bool:
+                      return op_type in self.mapping

Member

michaelbenayoun Aug 20, 2024

What is this registry used for?

Contributor Author

zhenglongjiepheonix Aug 20, 2024

This is for registration of parallel axis policy of different aten ops

Member

michaelbenayoun Aug 26, 2024

Can you add a docstring to explain that please?

optimum/fx/parallelization/op_registry/op_handlers.py Outdated

Comment on lines 127 to 130

+                  def propagate(self) -> bool:
+                      arg = self.node.all_input_nodes[0]
+                      axis = self.extract_axis(arg)
+                      return [axis]

Member

michaelbenayoun Aug 20, 2024

nit it's not returning a bool.
If I understand properly it returns the axis that is supposed to be parallel?

Contributor Author

zhenglongjiepheonix Aug 20, 2024 •

edited

Loading

Yes, you are right, it looks up the axis of all the inputs and try inferencing the axis of the output, and will return empty list if no valid axis on which the output can be parallelized

zhenglongjiepheonix added 4 commits

August 20, 2024 21:34


          address comments

22d6766


          remove idle runner

febac9b

fix

bf99175


          format

4d9d036

zhenglongjiepheonix requested a review from michaelbenayoun

August 21, 2024 01:06

michaelbenayoun reviewed

View reviewed changes

optimum/fx/parallelization/op_registry/op_handlers.py Outdated

Comment on lines 25 to 41

+              class Registry:
+                  def __init__(self) -> None:
+                      self.mapping = {}
+                  def register(self, op_types):
+                      def wrapper(cls):
+                          if isinstance(op_types, (list, tuple)):
+                              for op_type in op_types:
+                                  self.mapping[op_type] = cls
+                          else:
+                              self.mapping[op_types] = cls
+                          return cls
+                      return wrapper
+                  def is_supported(self, op_type) -> bool:
+                      return op_type in self.mapping

Member

michaelbenayoun Aug 26, 2024

Can you add a docstring to explain that please?

optimum/fx/parallelization/passes.py

Comment on lines +199 to +201

+                      def search(idx: int):
+                          if idx == len(nodes):
+                              return True

Member

michaelbenayoun Aug 26, 2024

Suggested change

      
                    def search(idx: int):
          
                        if idx == len(nodes):
          
                            return True
          
                    def search(idx: int) -> bool:
          
                        return idx == len(nodes)

Contributor Author

zhenglongjiepheonix Aug 26, 2024

the search here actually is a backtracking search function entailing more logic following, so we can only return True if we have reached the very last op


          more comments

44a87f4

zhenglongjiepheonix requested a review from michaelbenayoun

August 26, 2024 18:51


          Merge remote-tracking branch 'upstream/main' into longjie/generalize_…

1b3b3d5

…parallelization_strategy

michaelbenayoun approved these changes

View reviewed changes

Member

michaelbenayoun left a comment

LGTM!

optimum/fx/parallelization/op_registry/op_handlers.py Outdated

Comment on lines 27 to 28

		Registry class handles registration of parallel axis propagation handlers of different aten ops, to support a new
		aten op, you need to register the corresponding handler class by decorating it with `register` function.

Member

michaelbenayoun Sep 2, 2024

nit:

Suggested change

      
                Registry class handles registration of parallel axis propagation handlers of different aten ops, to support a new
          
                aten op, you need to register the corresponding handler class by decorating it with `register` function.
          
               Registry class handles registration of parallel axis propagation handlers of different aten ops.
          
               To support a new aten op, you need to register the corresponding handler class by decorating 
          
               it with the `register` function.

zhenglongjiepheonix added 2 commits

September 2, 2024 15:45


          Merge remote-tracking branch 'upstream/main' into longjie/generalize_…

c87f881

…parallelization_strategy

nit

944e244

Contributor Author

zhenglongjiepheonix commented Sep 2, 2024

merge this for irrelevant failures

zhenglongjiepheonix merged commit ad98dc9 into huggingface:main

43 of 46 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet