Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prodigy OpenAI project #180

Closed
wants to merge 33 commits into from
Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a4f9f92
Initial project setup
ljvmiranda921 Feb 21, 2023
8733fc3
Add .gitignore
ljvmiranda921 Feb 21, 2023
9cac816
Add dependencies
ljvmiranda921 Feb 21, 2023
ab5538d
Include spacy-transformers in dependencies
ljvmiranda921 Feb 21, 2023
9e8cd45
Add training configuration
ljvmiranda921 Feb 21, 2023
6fe164e
Update project.yml and add commands for training
ljvmiranda921 Feb 21, 2023
f36bb9c
Add script for converting to JSONL
ljvmiranda921 Feb 21, 2023
b3a1c1a
Add Prodigy OpenAI related scripts
ljvmiranda921 Feb 21, 2023
2157149
Add initial setup for evaluate gpt script
ljvmiranda921 Feb 21, 2023
950e609
Require spaCy to be <3.5.0
ljvmiranda921 Feb 21, 2023
35cc2df
Include the actual labels in the prompt
ljvmiranda921 Feb 21, 2023
8463a0d
Remove get-dataset dependency on assets
ljvmiranda921 Feb 22, 2023
3f75a92
Ship zero-shot predictions from OpenAI
ljvmiranda921 Feb 22, 2023
1a2bfa7
Update the project.yml
ljvmiranda921 Feb 22, 2023
f680695
Fix test set name in project.yml
ljvmiranda921 Feb 22, 2023
76b1834
Update evaluation script to check on spans
ljvmiranda921 Feb 22, 2023
8c984fc
Include span information when converting to JSONL
ljvmiranda921 Feb 23, 2023
c16aa00
Add train-curve command
ljvmiranda921 Feb 23, 2023
e011d2d
Update train-curve command and add a clean command
ljvmiranda921 Feb 24, 2023
95ccb15
Add plotext to deps for train-curve
ljvmiranda921 Feb 24, 2023
edb2591
Sync recipes based on v1.12 PR
ljvmiranda921 Feb 27, 2023
6127f89
Add initial description
ljvmiranda921 Feb 27, 2023
cc7a1df
Add command for ner.openai.correct
ljvmiranda921 Mar 1, 2023
02d9486
Add NER manual command
ljvmiranda921 Mar 1, 2023
f83dc40
Make cmd description explicit
ljvmiranda921 Mar 7, 2023
a7d1345
Update label names so they map properly in the UI
ljvmiranda921 Mar 7, 2023
ebe0c7f
Create LABELS file to easily pass them in prodigy
ljvmiranda921 Mar 20, 2023
0aa592e
Make evaluate_gpt more generalisable
ljvmiranda921 Mar 22, 2023
c361b56
Fix incorrect entity label
ljvmiranda921 Mar 22, 2023
c041ffc
Make assert condition less strict
ljvmiranda921 Mar 22, 2023
295fedf
Add filter process for evaluation
ljvmiranda921 Mar 22, 2023
1a95234
Generalize the evaluation script
ljvmiranda921 Apr 18, 2023
3d7b919
Accept multiple inputs for get_batches
ljvmiranda921 Apr 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions integrations/prodigy_openai/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
assets
corpus
metrics
training
data
49 changes: 49 additions & 0 deletions integrations/prodigy_openai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->

# 🪐 spaCy Project: Benchmarking OpenAI datasets

## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[spaCy projects documentation](https://spacy.io/usage/projects).

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `get-dataset` | Preprocess the AnEM dataset |
| `train` | Train a NER model from the AnEM corpus |
| `evaluate` | Evaluate results for the NER model |
| `openai-preprocess` | Convert from spaCy format into JSONL. |
| `openai-predict` | Fetch zero-shot NER results using Prodigy's GPT-3 integration |
| `openai-evaluate` | Evaluate zero-shot GPT-3 predictions |
| `train-curve` | Train a model at 25%, 50%, and 75% of the training data |

### ⏭ Workflows

The following workflows are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.

| Workflow | Steps |
| --- | --- |
| `ner` | `get-dataset` &rarr; `train` &rarr; `evaluate` |
| `gpt` | `openai-preprocess` &rarr; `openai-predict` &rarr; `openai-evaluate` |

### 🗂 Assets

The following assets are defined by the project. They can
be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)
in the project directory.

| File | Source | Description |
| --- | --- | --- |
| `assets/span-labeling-datasets` | Git | The span-labeling-datasets repository that contains loaders for AnEM |

<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->
143 changes: 143 additions & 0 deletions integrations/prodigy_openai/configs/ner.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
145 changes: 145 additions & 0 deletions integrations/prodigy_openai/configs/ner_trf.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
Loading