-
Notifications
You must be signed in to change notification settings - Fork 6
/
notes.txt
228 lines (210 loc) · 11 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
encoder
takes in sentence and transforms it into a feature tensor/vector
takes into account words around it
good for
masked language models
sequence classification
models
bert
roberta
albert
decoder
converts input into a feature tensor/vector
takes into account only the word and the previous words
models
gpt2
good for
causal language modeling
sequence to sequence
encoder provides meaning
decoder produces text
good for
translation
summarization
can be a mix of encoder and decoder models
models
mt5
bart
prophetNet
timeline
RNN model using scrappy techniques (published 2021)
https://towardsdatascience.com/create-your-own-artificial-shakespeare-in-10-minutes-with-natural-language-processing-1fde5edc8f28
developed a conversations mining bot to extract conversations in a discord server
could not use the bot for DMs so had to use a risky open source software for it
https://github.com/Tyrrrz/DiscordChatExporter
read up on what task chatting may beling in
https://cobusgreyling.medium.com/how-to-use-huggingface-in-your-chatbot-216ecc9a2170
came across an article for deploying on aws
https://towardsdatascience.com/deploy-chatbots-in-web-sites-using-hugging-face-dlcs-be59a86fd7ba
found the task: conversational
found an article handling the same challenge
https://www.freecodecamp.org/news/discord-ai-chatbot/
cleaned the content from any unfamiliar text
took a bit longer than expected
the original reddit dataset did not contain any urls, emojis, or special patterns/symbols
https://colab.research.google.com/drive/15wa925dj7jvdvrz8_z3vU7btqAFQLVlG#scrollTo=CjZaN5ilgd-z
provided a guide for pretraining the model
Seq2Seq pipeline was not suitable for the Dialo model
Could not build the bot with only huggingface had to turn to the following guide
https://colab.research.google.com/drive/15wa925dj7jvdvrz8_z3vU7btqAFQLVlG#scrollTo=CjZaN5ilgd-z
After hours of training on colab I had only a few minutes remaining before my runtime disconnected and lost all my progress
lesson: make sure your model is saved throught the process
The way the data was split into train test was dynamic and not static
created a manual pipeline that flows without issues
still need to improve and standardize the inputs before automating it
the .ipynb files display all the output data on git, therefore I have to restart git
learned to develop cli with config to then implement
need to restructure code to work as a singular bot
begin by directing user to create config file
discord bot API key
discord guild name
discord user to mimic
huggingface API key
model name
amt of context
create tests and raise errors for mine_data
adding more init tests
create mine tests and raise errors
need to deal with huggingface api key
use official cli
set up the default init path to guide to config path
if default init is config then simply add it to [general]
need to add a way to extrapolate data by randomizing context
added data extrapolation option
finished
added an option to extrapolate data
separated mine and preprocess
added preprocess tests & updated mine tests
implmented a new form of passing a path to the cli which can reference the config
at some point i need to flip the guild and session directories to read "../session/guild"
test to make sure it works on CPU
ensure saving of models happens in data_path
upload the following issue at some point
i cannot delete folders as that will require permissions so use_temp_dir wont happen
https://stackoverflow.com/questions/34716996/cant-remove-a-file-which-created-by-tempfile-mkstemp-on-windows
upload the following issue at some point https://github.com/huggingface/hub-docs/issues/new?assignees=&labels=new-task&template=adding-a-new-task-.md&title=Tracking+integration+for+%5BTASK%5D
When searching through docs, the "source documentation" is sometimes not suggested.
example: https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.create_repo
steps to scripting train
get config file
eliminate unecessary portions or at least clean them
pass api keys all throught the app
updated cli init
clean data
train
clean out excess outputs
fine tuning from
"Fine tuning model from: https://huggingface.co/{MODEL_FROM}\n"
initialize the repo
"Huggingface repo initialized at:\n{}\n"
or
"Huggingface repo already exists at:\n{}\n"
begins with a benchmark test result
"Initial benchmark test:\n{}\n"
Begin training
"Training started at:\n{}\n"
Iteration
"Iteration #{}:"
New benchmark
"Benchmark after iteration #{}:\n{}\n"
Upload model begin
"Uploading model to:\n{}\n"
Upload model finished
"Uploading finished, view it at:\n{}\n"
End training
"Training ended at:\n{}\n"
left off logging at "def main"
ensure continued training works
run tests
the following error could have easily been overcome by hinting "dirs_exist_ok" in shutil
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\colab'
it seems as if huggingface restricts your user to a limited amount of simultaneous requests to huggingface
I was uploading a model while fetching for a from their in inference api and I got an internal server error
CONTINUE IMPLEMENTING "ACTIVATE". LEFT OFF AFTER SETTING UP CLEAN_DF
had to reorganize the context_window and context_length so that there would always be a context length used for the same purpose (determining context length)
will need to add more ModelSave to make up for the environment variables
I was using the deprecated VSC extesion Python-autopep8 and it replaced entire files
emojis in channels cause breaks
initialize mimicbot
create cli
have cli run discord.bot
discord bot should call from users api
create end to end cli command
create cli
have cli run mimicbot
create test
deal with out of memory issues
create a mimicbot-deploy repo
create github ci/cd
after navigating the errors my commit history has become a mess
create train test
create activate test
create e2e test
maybe modify num train epochs
configure github actions
[{"url": "https://huggingface.co/SebastianS/mimicbot-MetalSebastian", "context_length": 2, "data_path": "C:\\Projects\\MimicBot\\data\\The Cocky Crew\\MetalSebastian"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalSebastian", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalSebastian"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalTaj", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalTaj"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalJessie", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalJessie"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalSam_v2", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalSam-v2"}]
tutorial
cli guide not read
intentions error
discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents that have not been explicitly enabled in the developer portal. It is recommended to go to https://discord.com/developers/applications/ and explicitly enable the privileged intents within your application's page.
fix low test data error
Traceback (most recent call last):
File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\__main__.py", line 7, in <module>
main()
File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\__main__.py", line 4, in main
cli.app(prog_name=__app_name__)
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\typer\main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\typer\main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\cli.py", line 318, in train_model
res, error = train.train(session_path)
File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\train.py", line 729, in train
main(trn_df, val_df, args, True)
File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\train.py", line 716, in main
result = evaluate(args, model, tokenizer,
File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\train.py", line 587, in evaluate
eval_loss = eval_loss / nb_eval_steps
ZeroDivisionError: float division by zero
requirements section
discord-independent-mimic
create a command for converting data into propper format
propper format
contains only content and author_id
in order from most recent to oldest (top to down)
will also contain a members file if it does not exist
adapt the preprocess command and the train command to use the propper format
option at init for wether it is an independent bot
True = path to messages file
False = continue to discord settings
step mine data will be skipped if True
mine will convert raw_data to raw_template_messages
introduced possibility of context to be pulled from different channel at channel splits
will need to change init dramatically
will change all discord inputs into a callback inside the custom training data
should i make a new init command
need to eliminate dependance on guild for path
need to change target user to be "general" not "discord"
the messages context may be fed in reverse
connect the files for mimicbot and mimicbot-deploy
to have a single source of truth
fix enter name for member with id by having the default be the id
create a pip package for mimicbot-chat
update mimicbot-deploy
allow bots to mention
create a pip package for mimicbot
fix tests on github actions