Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNC] Sync CentML -> hidet-org #455

Merged
merged 98 commits into from
Jul 23, 2024
Merged

[SYNC] Sync CentML -> hidet-org #455

merged 98 commits into from
Jul 23, 2024

Conversation

vadiklyutiy
Copy link
Collaborator

Regular sync CentML -> hidet-org

jacklee1792 and others added 30 commits July 22, 2024 23:28
Instead of mocking a ctypes type like `c_pointer_compatible`, these changes
make the transformation from Python values <> ctypes values more explicit
with direct function calls inside of `CompiledFunction`.
Define complete UNet, with forward pass broken into down, mid, and up
sections. Useful diagrams
[here](http://jalammar.github.io/illustrated-stable-diffusion/)

Uses blocks defined in #97. Heavily reduced version from diffusers
containing only necessary features for stable diffusion v2-1.

Towards #57.

---------

Co-authored-by: vadiklyutiy <[email protected]>
Stable diffusion uses fundamentally the same positional embeddings, but
since timesteps change, a cache is not possible.

There's also small changes in tensor layouts and calculation parameters
between the diffusers version and the one from Llama, so I've recreated
it here for now. An abstract version that combines both version is TODO.

Towards #57.
With CentML/hidet#69 there will be a lot more
C++ code introduced into the runtime, I think it's a good idea to have
some standardization. For now this is just doing formatting (no linting,
which takes more work to set up + has more opinions about right vs.
wrong).

Summary of changes:
- Update `format.sh` to support formatting just Python, C++, or both
- Add `clang-format` to the existing lint/format workflow
- Apply `clang-format` changes to existing code; I've set up the
configuration to try to minimize the number of changes and have excluded
the float16/bfloat16 code

Example workflow failure @ 4cc430c:
<img width="1155" alt="image"
src="https://github.com/CentML/hidet/assets/43303581/9566e9dd-bd01-4638-b556-11afaf7e6e52">
Add UNet Down, Up, and Mid block definitions and attention transformer
utility layer.

Modules are designed so that kwargs passed to constructors are all the
same config from huggingface with minimal changes - lots of shared
values and too many parameters to list individually. Same kwargs are
passed to nested objects. Open to other suggestions, although this is a
single use case problem.

Towards #57.
Adds supports for LLaMA, GPT-2, and OPT tokenizers using the Hugging Face configuration
Infrastructure for compiled stable diffusion app.

Towards #57
**Context:**
I made these changes to help with debugging Gemma, the dump produces
many operators and this makes it easier for example to find which
operators involve the input IDs / position IDs / KV-cache.

**Summary of changes:**
- Add missing dump_op parameter to ctx.debug()
- Dump input indices (e.g. @23) in operator dump
- Prevent dump_op and dump_outputs from overriding each other in the
single-output case

This is an example `41_Concat_def.txt` taken from my Gemma
implementation, which corresponds to concatenating past keys in the
KV-cache with the current keys. The `Inputs` field shows the indices of
the operator inputs, which might be another operator output `@n` or some
graph input `@in:n`.

```
Operator:
Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2)
Inputs:
	0 <- @in:2
	1 <- @40
Task:
Task(
  name: concat
  parameters: 
    x0: tensor(float32, [bs, 1, past_seq_len, 256])
    x1: tensor(float32, [bs, 1, seq_len, 256])
    out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256])
  inputs: [x0, x1]
  outputs: [out]
  computations: 
    out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3])
  attributes: {}
)
```
…ion(`implement*`) (#127)

This PR parallelizes: 
 - `apply_prolog_epilog` (fusion)
 - IR generation (`implement*`)

Right now implemented for host only (no offload to comp server). 

resnet50 compilation speed on g5.16xlarge 
`time python tests/benchmarks/bench_vision.py resnet50 --params 1x3x224x224 --dtype float16`
Before: 14m 45s
After: 12m 51s
Speedup: 14.8%

matmul compilation speed on g5.16xlarge 
`time python tests/benchmarks/bench_op.py batch_matmul --params 1x4096x4096,1x4096x4096 --dtype float16`
Before: 5m 54s
After: 5m 31s
Speedup: 6.9%
fix `__shfl_xor_sync`. I don't know why `__shfl_xor_sync` is an alias of
`__shfl_down_sync`. Is this intentional?

Co-authored-by: xiaocenxiaocen <[email protected]>
Closes #450.

Output of the example code provided in the issue:
```
/home/jack/dev/hidet/venv/bin/python3.8 /home/jack/.config/JetBrains/RemoteDev-PY/_home_jack_dev_hidet/scratches/scratch_2.py 
Compiling cpu task tan(x=float32(2, 2), y=float32(2, 2))...
Tensor(shape=(2, 2), dtype='float32', device='cpu')
[[  0.2568644  -1.0825194]
 [-32.35311    -1.5977247]]
```
The previous implementation is incorrect when dealing with a pair of
dimensions that are both symbolic. Minimal example:

import hidet

if __name__ == "__main__":
    x = hidet.symbol(["n"])
    y = hidet.symbol(["m"])
    z = x + y
    print(x.shape, y.shape, z.shape) # before: (n,) (m,) (m,)
)

**Overview** 
Specialize function `Constant._binary()` for compilation speedup

**Compilation time improvement results** 
matmul_f16 with `max_parallel_jobs=1`
Before: 2m 11.2s
After: 2m 4.4s
Speedup: 5.5%

**Additional test**
matmul_f16 has 177 candidates. I checked that all of them remained the same(no functional changes)
- The attention scalar should be by the head dimension.
- The option name `tokens.for_huggingface` is incorrect, see the
following:
https://github.com/CentML/hidet/blob/eefc9d81afe687e9173c65c68fc3c7eb4e3019a7/python/hidet/option.py#L299-L304

With these changes the LLM app runs correctly before tracing into
FlowGraph. Those changes will come later, I'm isolating these minor
changes into their own PR here.
Allow access to cluster attributes inside Hidet kernels. Launch kernels
with distributed shared memory.

See docs:


https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#distributed-shared-memory

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters

API:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cluster-group-cg

Towards supporting #102 by adding cluster rank primitive in Hidet.

See `test_cluster.py` for example usage. To run test on Hopper machines
use `pytest --hopper`
Gemma+torch.compile fixes:
 - process `_enter_autocast` and `_exit_autocast` as nop
 - support `truediv(float, Tensor)`
 - and support of eager mode to `tests/benchmarks`
The current exit hook is a no-op
Removes kwargs from stable diffusion app components.

Adds documentation and sample code.
Support only transpose operator with rank == 2

---------

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Max Hu <[email protected]>
Revive dynamic shape support with `torch.compile`.
It was broken due to changes in pytorch interface.
Adds ResNet and image classifier pipeline functionality.

Includes changes from #428

See huggingface implementation for original API inspiration.

Resolves CentML/hidet#60
…d` (#175)

1. Add  `torch.Tensor.sin` and `torch.Tensor.cos` to `register_method`
Gemma passed after that.

2. Add `torch._C._nn.pad` 
Test Workflow works with torch 2.3.0 after that
Co-authored-by: zita <[email protected]>
Co-authored-by: Kevin Tong <[email protected]>
Co-authored-by: xiaocenxiaocen <[email protected]>
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the
LLM, closes #164.

### SyncLLM.generate

Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling
parameters.
- If no sampling parameter is provided, greedy sampling is used.
- If 1 prompt and 1 sampling parameter is provided, the return is a
single `SequenceOutput`.
- If a list of n prompts and 1 sampling parameter is provided, the
sampling parameter is applied to all prompts and the return is a list of
`SequenceOutput`.
- If a list of n prompts and a list of n sampling parameters is
provided, the sampling parameters are applied respectively to each
prompt.
- Any other configuration is invalid.

### AsyncLLM.generate

Takes in 1 prompt and 0 or 1 sampling parameters. The same default from
the synchronous version applies if no sampling parameters are provided.
_Without blocking_, returns a async iterator over `SequenceOutput`,
which is updated with every token generated.

### Usage

Here's an example script to demonstrate the API.

```py
import asyncio
import random

from hidet.apps.llm import create_llm
from hidet.apps.llm.sampler import SamplingParams


async def _demo_async():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]

    coros = []
    for prompt in prompts:
        async def f(prompt):
            await asyncio.sleep(random.randint(1, 60))
            print("Incoming request: ", prompt)
            params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100))
            stream = llm.generate(prompt, sampling_params=params)
            final = None
            async for output in stream:
                # print(output.tokens)
                final = output
            print("=====")
            print("Completed request: ", prompt)
            print("Output: ", final.output_text)
            print("=====")
        coros.append(f(prompt))

    await asyncio.gather(*coros)


def demo_async():
    asyncio.run(_demo_async())


def demo_sync():
    llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False)
    prompts = [
        "Hello, how are you?",
        "How do you feel about the current political climate?",
        "What is your favorite food?",
        "What is your favorite color?",
        "What is your favorite movie?",
        "What is your favorite book?",
        "What is your favorite song?",
        "What is your favorite animal?",
        "What is your favorite hobby?",
        "When is your birthday?",
    ]
    for output in llm.generate(prompts):
        print("=====")
        print("Completed request: ", output.prompt)
        print("Output: ", output.output_text)
        print("=====")


if __name__ == "__main__":
    demo_sync()
    # demo_async()
```

---------

Co-authored-by: Yaoyao Ding <[email protected]>
- Shuffle workload (candidates) to avoid imbalance in compliation time
- Modify workload group to make the job number the same as cpu number

Co-authored-by: Ubuntu <[email protected]>
I noticed that we spend sufficient time on creation process in `parallel_imap`.

Add `chunksize` arg to `pool.imap` to decrease the overhead. 

**Results.**
`time python bench_op.py matmul_f16 --params 1x4096x4096,1x4096x4096
--dtype float16`
`time python bench_op.py batch_matmul --params 1x4096x4096,1x4096x4096
--dtype float16`

| Test | Before(s) | After(s) | Improvement |
|--------|--------|--------|--------|
| matmul_f16 | 42.768 | 42.138 | 1.5% |
| batch_matmul |  34m29.1s | 34m10.1s | 0.9% |
BolinSNLHM and others added 19 commits July 22, 2024 23:28
After disallowing functions unsupported by Hidet as in #317 , the
compilation of the model `vision_maskrcnn` (previously failed on
unsupported `topk` method, as in #267 ) failed on a TypeError with the
following traceback message:

> File
"/home/bolin/Desktop/hidet/python/hidet/graph/graph_utils/functors.py",
line 75, in visit
> ret = self.visit_Operator(obj) # pylint: disable=assignment-from-none
>           ^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/home/bolin/Desktop/hidet/python/hidet/graph/graph_utils/functors.py",
line 126, in visit_Operator
>     updated_outputs = op.reforward(inputs)
>                       ^^^^^^^^^^^^^^^^^^^^
> File "/home/bolin/Desktop/hidet/python/hidet/graph/operator.py", line
185, in reforward
>     return cls(*inputs, **attributes).outputs
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^
> torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised:
> TypeError: ClampOp.__init__() missing 2 required positional arguments:
'min_value' and 'max_value'


The cause is that, inside the[ `reforward`
function](https://github.com/CentML/hidet/blob/da56e48148c5b075f1fba6d1d878a82889c9f731/python/hidet/graph/operator.py#L180-L185),
during the call to `cls(*inputs, **attributes)`, where `cls` is
`ClampOp`, `inputs` only consists of the input tensor and `attributes`
is an empty dictionary, so the `min_value` and `max_values` cannot be
passed to the initializer. This is because we did not initialize the
`attributes` dictionary to contain the values of these two parameter
[while initializing
`ClampOp`](https://github.com/CentML/hidet/blob/da56e48148c5b075f1fba6d1d878a82889c9f731/python/hidet/graph/ops/arithmetic.py#L586-L595).
Currently the `Tensor` class does not have a `__ge__` method, which
leads to an error during a model compilation:

> torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised:
> RuntimeError: '>=' not supported between instances of 'Tensor' and
'float', occurred when interpreting operator.ge with
>   ge(tensor(...), 0.001)
> ge is defined at
> File
"/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py",
line 1123

This can be solved by adding `__ge__` special method in the `Tensor`
class.
Inherit `options` from `torch.compile(..., options={}, ...)`
Continue with option cleaning. 

Remove `dynamo_config['search_space']`
This review disallows in fxgraph funcs that are unsupported(non-registered) in hidet.

fxgraph contains functions, methods(methods of `torch.Tensor`) and modules(`torch.nn`). These changes carry about functions only. 

Notes.
1. Works with torch version >= 2.2.0
2. There are a number of functions that allowed and appear in fxgraph on dynamo level but dynamo resolved it before passing the fxgraph to the compiler. If just disallow them we get an additional graph break. As a workaround these funcs are registered but the implementation just raise the exception.
In #342 accidentally disable `search_space=2` for `bench_op.py`
Regression script. Fixed it.
…332)

[Edit: The issue was encountered while attempting to compile the model
`yolov3`]

Currently the
[`setitem`](https://github.com/CentML/hidet/blob/566f0fe55f441326c3034b7eed44b3fa0b03f38d/python/hidet/graph/frontend/torch/register_functions.py#L280)
function in Hidet will fail on two special scenarios when `setvalue` is
a tensor:

1. When `setvalue` and `x` are of different dtypes, currently there will
be an error that looks like:
> RuntimeError: If-then-else operand 1 and 2 have different types
(hidet.float16 vs hidet.float32) ((((v < 0) || (2 <= v)) ? false :
(((v_1 < 0) || (3 <= v_1)) ? false : (((v_2 < 0) || (3 <= v_2)) ? false
: true))) ? setvalue[v_2, v_1, v] : data[v_2, v_1, v]), occurred when
interpreting operator.setitem with
>   setitem(tensor(...), (Ellipsis, slice(None, 2, None)), tensor(...))

Whereas in PyTorch `setvalue` appears to be casted to the same datatype
as `x` if possible.

2. When `setvalue` and `x` are on different devices, currently this will
result in an error:

> RuntimeError: All inputs of an operator must be on the same device,
occurred when interpreting operator.setitem with
>   setitem(tensor(...), (Ellipsis, slice(None, 2, None)), tensor(...))

Whereas in PyTorch the `setvalue` is moved to the same device as `x`.
Previously, an error was encountered during a model compilation attempt:

> torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised:
> RuntimeError: Can not interpreting max given arguments:
>   max(tensor(...))
> Possible candidates are:
> torch_max_v3(x: hidet.Tensor, dim: Union[int, hidet.ir.expr.Expr],
keepdim: bool = False, *, out: Union[hidet.Tensor, Tuple[hidet.Tensor,
...], List[hidet.Tensor]] = None) -> Tuple[hidet.Tensor, hidet.Tensor]
> File
"/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py",
line 1067

Despite we indeed have a
[function](https://github.com/CentML/hidet/blob/13a806608d40de2de1fcc682adeea8d204189f3c/python/hidet/graph/frontend/torch/register_functions.py#L1056-L1060)
that can be used to interpret the `torch.Tensor.max` with described
arguments.
… for conv-bert-base model (#351)

Added support for `torch.multiply` and `torch.nn.functional.unfold`
These ops are needed in `conv-bert-base` models

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
-  Fixed cuda declaration and definition dtype mistmatch

- Added 3 more llms: mpt-7b, codellama-7b and mixtral-8x7b. Fitst two
are tested and working fine.

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Promote nvidia docker container to version 24.4 => Getting pytorch 2.3

Regression passed 
https://github.com/CentML/hidet/actions/runs/9964867474
Introduce `add_hint_pass`. It adds `__builtin_assume(...)` to .cu code
that helps nvcc to understand bounds if `threadIdx` and `blockIdx` and
optimize code better.

**Performance improvements.** 
Models
model|latency|prev_latency|ratio|
|--------|--------|--------|--------|
bert-base-uncased|19.8138|20.2316|2.109
densenet121|35.1161|36.7627|4.689
efficientnet_b0|18.9451|19.278|1.757
mobilenet_v2|11.5944|11.8764|2.432
resnet50|29.4878|29.9935|1.715
vit_b_16|125.787|123.672|-1.681

Operators
operator|latency|prev_latency|ratio
|--------|--------|--------|--------|
attn|1.50402|1.50131|-0.18
attn|0.219707|0.227568|3.578
attn_mask_add|1.5892|1.62516|2.263
attn_mask_add|0.226317|0.226507|0.084
batch_matmul|5.2399|5.11547|-2.375
batch_matmul|0.0216016|0.0223425|3.43
conv2d|0.0347093|0.0341758|-1.537
conv2d|0.310521|0.308458|-0.664
conv2d_gemm_f16|0.142542|0.146412|2.715
conv2d_gemm_f16|2.0421|2.07043|1.387
matmul_f16|2.22432|2.30458|3.608
matmul_f16|0.00888628|0.00892615|0.449
reduce|0.01375|0.0138618|0.813
…ents are supported by Hidet (#347)

Currently Hidet cannot compile `doctr_reco_predictor` model due to
unsupported `torch.Tensor.min`, despite we have already registered
`torch.min` function which is functionally equivalent.

This PR registers all the missing `torch.Tensor` methods with PyTorch
function equivalents already registered.
When we used `__builtin_unreachable()` for hint the info about bounds
lost after some code. There was an introduced workaround that added
additional hints after loops.

After switching to `__builtin_assume()` the issue disappeared. 
This PR removes the workaround. 

No performance changes. 
http://10.24.10.108:8868/Build_History 
66fd65c after
3f955de before
Recently frequently occurs a fail of regression due to fail
start_instance due to "Insufficient capacity".

Repeat attempts to start instances 300 times with 60 seconds sleep between repeats.

Tested here
https://github.com/CentML/hidet/actions/runs/10000711025/job/27664169588
@yaoyaoding
Copy link
Member

Seems the ci failed on self-hosted runners.

@vadiklyutiy
Copy link
Collaborator Author

yes, @c-fteixeira is looking into it

@yaoyaoding
Copy link
Member

Kindly remind that we need to use "merge" instead of "squash and merge" in this PR.

@vadiklyutiy
Copy link
Collaborator Author

sure, I already asked Shang and he enabled merge and rebase option on this repo

@vadiklyutiy
Copy link
Collaborator Author

@yaoyaoding @wangshangsam @hjjq
Tests passed.
Please look into and approve if everything is ok

Copy link
Member

@yaoyaoding yaoyaoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @vadiklyutiy !

@vadiklyutiy vadiklyutiy merged commit cf5cadd into main Jul 23, 2024
19 checks passed
@vadiklyutiy vadiklyutiy deleted the sync-0-4-0 branch July 27, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants