Skip to content

Commit f5f0bfe

Browse files
committed
update: readme + docs
1 parent 03d0041 commit f5f0bfe

File tree

3 files changed

+63
-67
lines changed

3 files changed

+63
-67
lines changed

README.md

+31-32
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Hemm: Holistic Evaluation of Multi-modal Generative Models
22

3+
[![](https://img.shields.io/badge/Hemm-docs-blue)](https://wandb.github.io/Hemm/)
4+
35
Hemm is a library for performing comprehensive benchmark of text-to-image diffusion models on image quality and prompt comprehension integrated with [Weights & Biases](https://wandb.ai/site) and [Weave](https://wandb.github.io/weave/).
46

57
Hemm is highly inspired by the following projects:
@@ -8,78 +10,75 @@ Hemm is highly inspired by the following projects:
810
- [T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation](https://karine-h.github.io/T2I-CompBench-new/)
911
- [GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment](https://arxiv.org/abs/2310.11513)
1012

11-
> [!WARNING]
12-
> Hemm is still in early development, the API is subject to change, expect things to break. If you are interested in contributing, please feel free to open an issue and/or raise a pull request.
13+
| ![](./docs/assets/evals.gif) |
14+
|:--:|
15+
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
16+
17+
## Leaderboards
18+
19+
| Leaderboard | Weave Evals |
20+
|---|---|
21+
| [Rendering prompts with Complex Actions](https://wandb.ai/hemm-eval/mllm-eval-action/reports/Leaderboard-Rendering-prompts-with-Complex-Actions--Vmlldzo5Mjg2Nzky) | [Weave Evals](https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations) |
1322

1423
## Installation
1524

25+
First, we recommend you install the PyTorch by visiting [pytorch.org/get-started/locally](https://pytorch.org/get-started/locally/).
26+
1627
```shell
17-
git clone https://github.com/soumik12345/Hemm
28+
git clone https://github.com/wandb/Hemm
1829
cd Hemm
1930
pip install -e ".[core]"
2031
```
2132

2233
## Quickstart
2334

24-
First let's publish a small subset of the MSCOCO validation set as a [Weave Dataset](https://wandb.github.io/weave/guides/core-types/datasets/).
25-
26-
```python
27-
import weave
28-
from hemm.utils import publish_dataset_to_weave
29-
30-
weave.init(project_name="t2i_eval")
31-
32-
dataset_reference = publish_dataset_to_weave(
33-
dataset_path="HuggingFaceM4/COCO",
34-
prompt_column="sentences",
35-
ground_truth_image_column="image",
36-
split="validation",
37-
dataset_transforms=[
38-
lambda item: {**item, "sentences": item["sentences"]["raw"]}
39-
],
40-
data_limit=5,
41-
)
42-
```
35+
First, you need to publish your evaluation dataset to Weave. Check out [this tutorial](https://weave-docs.wandb.ai/guides/core-types/datasets) that shows you how to publish a dataset on your project.
4336

44-
| ![](./docs/assets/weave_dataset.gif) |
45-
|:--:|
46-
| [Weave Datasets](https://wandb.github.io/weave/guides/core-types/datasets/) enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Easily update datasets with the UI and download the latest version locally with a simple API. |
47-
48-
Next, you can evaluate Stable Diffusion 1.4 on image quality metrics as shown in the following code snippet:
37+
Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
4938

5039
```python
5140
import wandb
5241
import weave
5342

43+
5444
from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
5545
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric
5646

47+
5748
# Initialize Weave and WandB
5849
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
5950
weave.init(project_name="image-quality-leaderboard")
6051

52+
6153
# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
54+
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
55+
# You can write your own model `weave.Model` if your model is not diffusers compatible.
6256
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
6357

58+
6459
# Add the model to the evaluation pipeline
6560
evaluation_pipeline = EvaluationPipeline(model=model)
6661

62+
6763
# Add PSNR Metric to the evaluation pipeline
6864
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
6965
evaluation_pipeline.add_metric(psnr_metric)
7066

67+
7168
# Add SSIM Metric to the evaluation pipeline
7269
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
7370
evaluation_pipeline.add_metric(ssim_metric)
7471

72+
7573
# Add LPIPS Metric to the evaluation pipeline
7674
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
7775
evaluation_pipeline.add_metric(lpips_metric)
7876

77+
78+
# Get the Weave dataset reference
79+
dataset = weave.ref("COCO:v0").get()
80+
81+
7982
# Evaluate!
80-
evaluation_pipeline(dataset="COCO:v0")
83+
evaluation_pipeline(dataset=dataset)
8184
```
82-
83-
| ![](./docs/assets/weave_leaderboard.gif) |
84-
|:--:|
85-
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |

docs/assets/evals.gif

28.6 MB
Loading

docs/index.md

+32-35
Original file line numberDiff line numberDiff line change
@@ -12,78 +12,75 @@ Hemm is highly inspired by the following projects:
1212

1313
- [GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment](https://arxiv.org/abs/2310.11513)
1414

15-
!!! warning
16-
Hemm is still in early development, the API is subject to change, expect things to break. If you are interested in contributing, please feel free to open an issue and/or raise a pull request.
15+
| ![](./docs/assets/evals.gif) |
16+
|:--:|
17+
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
18+
19+
## Leaderboards
20+
21+
| Leaderboard | Weave Evals |
22+
|---|---|
23+
| [Rendering prompts with Complex Actions](https://wandb.ai/hemm-eval/mllm-eval-action/reports/Leaderboard-Rendering-prompts-with-Complex-Actions--Vmlldzo5Mjg2Nzky) | [Weave Evals](https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations) |
1724

1825
## Installation
1926

27+
First, we recommend you install the PyTorch by visiting [pytorch.org/get-started/locally](https://pytorch.org/get-started/locally/).
28+
2029
```shell
21-
git clone https://github.com/soumik12345/Hemm
30+
git clone https://github.com/wandb/Hemm
2231
cd Hemm
2332
pip install -e ".[core]"
2433
```
2534

2635
## Quickstart
2736

28-
First let's publish a small subset of the MSCOCO validation set as a [Weave Dataset](https://wandb.github.io/weave/guides/core-types/datasets/).
37+
First, you need to publish your evaluation dataset to Weave. Check out [this tutorial](https://weave-docs.wandb.ai/guides/core-types/datasets) that shows you how to publish a dataset on your project.
2938

30-
```python
31-
import weave
32-
from hemm.utils import publish_dataset_to_weave
33-
34-
weave.init(project_name="t2i_eval")
35-
36-
dataset_reference = publish_dataset_to_weave(
37-
dataset_path="HuggingFaceM4/COCO",
38-
prompt_column="sentences",
39-
ground_truth_image_column="image",
40-
split="validation",
41-
dataset_transforms=[
42-
lambda item: {**item, "sentences": item["sentences"]["raw"]}
43-
],
44-
data_limit=5,
45-
)
46-
```
47-
48-
| ![](./assets/weave_dataset.gif) |
49-
|:--:|
50-
| [Weave Datasets](https://wandb.github.io/weave/guides/core-types/datasets/) enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Easily update datasets with the UI and download the latest version locally with a simple API. |
51-
52-
Next, you can evaluate Stable Diffusion 1.4 on image quality metrics as shown in the following code snippet:
39+
Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
5340

5441
```python
5542
import wandb
5643
import weave
5744

58-
from hemm.eval_pipelines import BaseWeaveModel, EvaluationPipeline
59-
from hemm.metrics.image_quality import LPIPSMetric, PSNRMetric, SSIMMetric
45+
46+
from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
47+
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric
48+
6049

6150
# Initialize Weave and WandB
6251
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
6352
weave.init(project_name="image-quality-leaderboard")
6453

54+
6555
# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
66-
model = BaseWeaveModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
56+
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
57+
# You can write your own model `weave.Model` if your model is not diffusers compatible.
58+
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
59+
6760

6861
# Add the model to the evaluation pipeline
6962
evaluation_pipeline = EvaluationPipeline(model=model)
7063

64+
7165
# Add PSNR Metric to the evaluation pipeline
7266
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
7367
evaluation_pipeline.add_metric(psnr_metric)
7468

69+
7570
# Add SSIM Metric to the evaluation pipeline
7671
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
7772
evaluation_pipeline.add_metric(ssim_metric)
7873

74+
7975
# Add LPIPS Metric to the evaluation pipeline
8076
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
8177
evaluation_pipeline.add_metric(lpips_metric)
8278

79+
80+
# Get the Weave dataset reference
81+
dataset = weave.ref("COCO:v0").get()
82+
83+
8384
# Evaluate!
84-
evaluation_pipeline(dataset="COCO:v0")
85+
evaluation_pipeline(dataset=dataset)
8586
```
86-
87-
| ![](./assets/weave_leaderboard.gif) |
88-
|:--:|
89-
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |

0 commit comments

Comments
 (0)