-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add enjim dataset(s) #9
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely top notch work, thank you! Just a couple minor points to address.
from toolbox.core.models import Episode, Turn | ||
from toolbox.datasets.enjim import EnjimDataset, EnjimAgent, setup_sqlite | ||
from toolbox.modules import BaseModule | ||
from toolbox.modules.registry import ModuleRegistry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you were going to implement the registry pattern for the modules but backed away from doing that in this PR?
This import causes a crash since the file doesn't exist, but removing it and the reference below fixes it since it's not used elsewhere.
else self.summarize(spoken, self.settings['max_scenario_chars'], None, None), | ||
speaker=speaker, human_speaker=speaker != bot_name) for speaker, spoken in | ||
posts] | ||
participant_personas = {ag.name: self.summarize_char(ag, self.settings['max_persona_chars'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor issues with the personas:
- Empty personas are generated sometimes, resulting in processed text that looks like:
Which is a little wasteful token-wise and might be confusing to the model when training.
A's Persona: B's Persona: (actual persona text here) Scenario: (proper scenario here) ...
- They seem to be unparsed query results instead of cleaned text, so instead of generating
They look like:
A's Persona: A is like this and that
This is also happening onA's Persona: ("A is like this and that", )
world_scenario
.
To run/set up:
- download sqlite from https://www.sqlite.org/download.html and install
- install/update dependencies however you're managing the .toml requirements locally
- update catbox paths in config.yaml
- run normally
To sanity-check the output (basically checks for bizarrely-formatted garbage):
grep -E "<[^* '\n3A-Z=<\.]+>|\[[^A-Z* 0-9r\.]{0,10}\]|&[^ \n]{1,10};|:[^ \n0-9]{1,10}:" rev-020c8d0-args-d93e21a.jsonl -c