Skip to content

Commit 1311191

Browse files
Merge branch 'main' into speculative_inference
2 parents a59e38a + 22afba6 commit 1311191

20 files changed

+269
-92
lines changed

README.md

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@
88
<br>
99
</p>
1010

11-
Generate text with distributed **Llama 2** (70B), **Falcon** (40B+), **BLOOM** (176B) (or their derivatives), and fine‑tune them for your own tasks &mdash; right from your desktop computer or Google Colab:
11+
Generate text with distributed **Llama 3.1** (up to 405B), **Mixtral** (8x22B), **Falcon** (40B+) or **BLOOM** (176B) and fine‑tune them for your own tasks &mdash; right from your desktop computer or Google Colab:
1212

1313
```python
1414
from transformers import AutoTokenizer
1515
from petals import AutoDistributedModelForCausalLM
1616

1717
# Choose any model available at https://health.petals.dev
18-
model_name = "petals-team/StableBeluga2" # This one is fine-tuned Llama 2 (70B)
18+
model_name = "meta-llama/Meta-Llama-3.1-405B-Instruct"
1919

2020
# Connect to a distributed network hosting model layers
2121
tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -31,22 +31,26 @@ print(tokenizer.decode(outputs[0])) # A cat sat on a mat...
3131
🚀 &nbsp;<b><a href="https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing">Try now in Colab</a></b>
3232
</p>
3333

34-
🔏 **Privacy.** Your data will be processed with the help of other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.
34+
🦙 **Want to run Llama?** [Request access](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to its weights, then run `huggingface-cli login` in the terminal before loading the model. Or just try it in our [chatbot app](https://chat.petals.dev).
3535

36-
🦙 **Want to run Llama 2?** Request access to its weights at the ♾️ [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf), then run `huggingface-cli login` in the terminal before loading the model. Or just try it in our [chatbot app](https://chat.petals.dev).
36+
🔏 **Privacy.** Your data will be processed with the help of other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.
3737

3838
💬 **Any questions?** Ping us in [our Discord](https://discord.gg/KdThf2bWVU)!
3939

4040
## Connect your GPU and increase Petals capacity
4141

42-
Petals is a community-run system &mdash; we rely on people sharing their GPUs. You can check out [available models](https://health.petals.dev) and help serving one of them! As an example, here is how to host a part of [Stable Beluga 2](https://huggingface.co/stabilityai/StableBeluga2) on your GPU:
42+
Petals is a community-run system &mdash; we rely on people sharing their GPUs. You can help serving one of the [available models](https://health.petals.dev) or host a new model from 🤗 [Model Hub](https://huggingface.co/models)!
43+
44+
As an example, here is how to host a part of [Llama 3.1 (405B) Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) on your GPU:
45+
46+
🦙 **Want to host Llama?** [Request access](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to its weights, then run `huggingface-cli login` in the terminal before loading the model.
4347

4448
🐧 **Linux + Anaconda.** Run these commands for NVIDIA GPUs (or follow [this](https://github.com/bigscience-workshop/petals/wiki/Running-on-AMD-GPU) for AMD):
4549

4650
```bash
4751
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
4852
pip install git+https://github.com/bigscience-workshop/petals
49-
python -m petals.cli.run_server petals-team/StableBeluga2
53+
python -m petals.cli.run_server meta-llama/Meta-Llama-3.1-405B-Instruct
5054
```
5155

5256
🪟 **Windows + WSL.** Follow [this guide](https://github.com/bigscience-workshop/petals/wiki/Run-Petals-server-on-Windows) on our Wiki.
@@ -56,27 +60,25 @@ python -m petals.cli.run_server petals-team/StableBeluga2
5660
```bash
5761
sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
5862
learningathome/petals:main \
59-
python -m petals.cli.run_server --port 31330 petals-team/StableBeluga2
63+
python -m petals.cli.run_server --port 31330 meta-llama/Meta-Llama-3.1-405B-Instruct
6064
```
6165

6266
🍏 **macOS + Apple M1/M2 GPU.** Install [Homebrew](https://brew.sh/), then run these commands:
6367

6468
```bash
6569
brew install python
6670
python3 -m pip install git+https://github.com/bigscience-workshop/petals
67-
python3 -m petals.cli.run_server petals-team/StableBeluga2
71+
python3 -m petals.cli.run_server meta-llama/Meta-Llama-3.1-405B-Instruct
6872
```
6973

7074
<p align="center">
7175
📚 &nbsp;<b><a href="https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#running-a-server">Learn more</a></b> (how to use multiple GPUs, start the server on boot, etc.)
7276
</p>
7377

74-
💬 **Any questions?** Ping us in [our Discord](https://discord.gg/X7DgtxgMhc)!
75-
76-
🦙 **Want to host Llama 2?** Request access to its weights at the ♾️ [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf), generate an 🔑 [access token](https://huggingface.co/settings/tokens), then add `--token YOUR_TOKEN_HERE` to the `python -m petals.cli.run_server` command.
77-
7878
🔒 **Security.** Hosting a server does not allow others to run custom code on your computer. Learn more [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety).
7979

80+
💬 **Any questions?** Ping us in [our Discord](https://discord.gg/X7DgtxgMhc)!
81+
8082
🏆 **Thank you!** Once you load and host 10+ blocks, we can show your name or link on the [swarm monitor](https://health.petals.dev) as a way to say thanks. You can specify them with `--public_name YOUR_NAME`.
8183

8284
## How does it work?
@@ -120,22 +122,39 @@ Please see **Section 3.3** of our [paper](https://arxiv.org/pdf/2209.01188.pdf).
120122

121123
Please see our [FAQ](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#contributing) on contributing.
122124

123-
### 📜 Citation
125+
### 📜 Citations
124126

125127
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel.
126128
[Petals: Collaborative Inference and Fine-tuning of Large Models.](https://arxiv.org/abs/2209.01188)
127-
_arXiv preprint arXiv:2209.01188,_ 2022.
129+
_Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)._ 2023.
128130

129131
```bibtex
130-
@article{borzunov2022petals,
132+
@inproceedings{borzunov2023petals,
131133
title = {Petals: Collaborative Inference and Fine-tuning of Large Models},
132-
author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Ryabinin, Max and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
133-
journal = {arXiv preprint arXiv:2209.01188},
134-
year = {2022},
134+
author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Riabinin, Maksim and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
135+
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
136+
pages = {558--568},
137+
year = {2023},
135138
url = {https://arxiv.org/abs/2209.01188}
136139
}
137140
```
138141

142+
Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin Raffel.
143+
[Distributed inference and fine-tuning of large language models over the Internet.](https://arxiv.org/abs/2312.08361)
144+
_Advances in Neural Information Processing Systems_ 36 (2023).
145+
146+
```bibtex
147+
@inproceedings{borzunov2023distributed,
148+
title = {Distributed inference and fine-tuning of large language models over the {I}nternet},
149+
author = {Borzunov, Alexander and Ryabinin, Max and Chumachenko, Artem and Baranchuk, Dmitry and Dettmers, Tim and Belkada, Younes and Samygin, Pavel and Raffel, Colin},
150+
booktitle = {Advances in Neural Information Processing Systems},
151+
volume = {36},
152+
pages = {12312--12331},
153+
year = {2023},
154+
url = {https://arxiv.org/abs/2312.08361}
155+
}
156+
```
157+
139158
--------------------------------------------------------------------------------
140159

141160
<p align="center">

setup.cfg

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,22 +32,21 @@ package_dir =
3232
packages = find:
3333
python_requires = >=3.8
3434
install_requires =
35-
torch>=1.12,<2.3.0
35+
torch>=1.12
3636
bitsandbytes==0.41.1
3737
accelerate>=0.27.2
3838
huggingface-hub>=0.11.1,<1.0.0
3939
tokenizers>=0.13.3
40-
transformers==4.41.2 # if you change this, please also change version assert in petals/__init__.py
40+
transformers==4.43.1 # if you change this, please also change version assert in petals/__init__.py
4141
speedtest-cli==2.1.3
42-
pydantic>=1.10,<2.0 # 2.0 is incompatible with hivemind yet
43-
hivemind==1.1.10.post2
42+
hivemind @ git+https://github.com/learning-at-home/hivemind.git@213bff98a62accb91f254e2afdccbf1d69ebdea9
4443
tensor_parallel==1.0.23
4544
humanfriendly
4645
async-timeout>=4.0.2
4746
cpufeature>=0.2.0; platform_machine == "x86_64"
4847
packaging>=20.9
4948
sentencepiece>=0.1.99
50-
peft==0.5.0
49+
peft==0.8.2
5150
safetensors>=0.3.1
5251
Dijkstar>=2.6.0
5352
numpy<2

src/petals/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222

2323
if not os.getenv("PETALS_IGNORE_DEPENDENCY_VERSION"):
2424
assert (
25-
version.parse("4.41.2") <= version.parse(transformers.__version__) < version.parse("4.42.0")
26-
), "Please install a proper transformers version: pip install transformers>=4.41.2,<4.42.0"
25+
version.parse("4.43.1") <= version.parse(transformers.__version__) < version.parse("4.44.0")
26+
), "Please install a proper transformers version: pip install transformers>=4.43.1,<4.44.0"
2727

2828

2929
def _override_bfloat16_mode_default():

src/petals/client/inference_session.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -336,7 +336,7 @@ def step(
336336
self._update_sequence(server_idx, block_idx, attempt_no)
337337

338338
server_session = self._server_sessions[server_idx]
339-
assert server_session.position == self.position
339+
assert server_session.position == self.position, f"Position mismatch: {server_session.position} and {self.position}"
340340
inputs = server_session.step(
341341
inputs,
342342
prompts[server_session.span.start : server_session.span.end],

src/petals/data_structures.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
from enum import Enum
33
from typing import Any, Dict, Optional, Sequence, Tuple
44

5-
import pydantic
5+
import pydantic.v1 as pydantic
66
from hivemind import PeerID
77
from hivemind.moe.expert_uid import ExpertUID
88

src/petals/models/bloom/block.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
import torch
99
from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
10-
from transformers.models.bloom.modeling_bloom import BloomBlock, BloomModel, build_alibi_tensor
10+
from transformers.models.bloom.modeling_bloom import BloomBlock, build_alibi_tensor
1111

1212
from petals.utils.misc import is_dummy
1313

src/petals/models/bloom/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ class DistributedBloomConfig(BloomConfig, ClientConfig, PTuneConfig, LMHeadConfi
2424
def from_pretrained(
2525
cls, model_name_or_path: Union[str, os.PathLike, None], *args, dht_prefix: Optional[str] = None, **kwargs
2626
):
27-
logger.info("Make sure you follow the BLOOM's terms of use: https://bit.ly/bloom-license")
27+
logger.info("Make sure you follow the BLOOM terms of use: https://bit.ly/bloom-license")
2828

2929
loading_from_repo = model_name_or_path is not None and not os.path.isdir(model_name_or_path)
3030
if loading_from_repo and dht_prefix is None:

src/petals/models/llama/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,13 @@
55
DistributedLlamaForSequenceClassification,
66
DistributedLlamaModel,
77
)
8+
from petals.models.llama.speculative_model import DistributedLlamaForSpeculativeGeneration
89
from petals.utils.auto_config import register_model_classes
910

1011
register_model_classes(
1112
config=DistributedLlamaConfig,
1213
model=DistributedLlamaModel,
1314
model_for_causal_lm=DistributedLlamaForCausalLM,
15+
model_for_speculative=DistributedLlamaForSpeculativeGeneration,
1416
model_for_sequence_classification=DistributedLlamaForSequenceClassification,
1517
)

src/petals/models/llama/block.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
LlamaConfig,
1616
LlamaDecoderLayer,
1717
LlamaMLP,
18-
LlamaModel,
1918
LlamaRMSNorm,
2019
repeat_kv,
2120
rotate_half,
@@ -132,7 +131,8 @@ class OptimizedLlamaDecoderLayer(LlamaDecoderLayer):
132131
def __init__(self, config: LlamaConfig):
133132
nn.Module.__init__(self)
134133
self.hidden_size = config.hidden_size
135-
self.self_attn = OptimizedLlamaAttention(config=config)
134+
self.self_attn = OptimizedLlamaAttention(config=config, layer_idx=0)
135+
# layer_idx only matters for KV caching, and we re-implement it in Petals
136136
self.mlp = LlamaMLP(config)
137137
self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
138138
self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

src/petals/models/llama/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ def from_pretrained(
2727
cls, model_name_or_path: Union[str, os.PathLike, None], *args, dht_prefix: Optional[str] = None, **kwargs
2828
):
2929
logger.info(
30-
"Make sure you follow the LLaMA's terms of use: "
31-
"https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1"
30+
"Make sure you follow the Llama terms of use: "
31+
"https://llama.meta.com/llama3/license, https://llama.meta.com/llama2/license"
3232
)
3333

3434
loading_from_repo = model_name_or_path is not None and not os.path.isdir(model_name_or_path)

0 commit comments

Comments
 (0)