notes.txt

encoder
  takes in sentence and transforms it into a feature tensor/vector
  takes into account words around it
  good for
    masked language models
    sequence classification
  models
    bert
    roberta
    albert
decoder
  converts input into a feature tensor/vector
  takes into account only the word and the previous words
  models
    gpt2
  good for
    causal language modeling
sequence to sequence
  encoder provides meaning
  decoder produces text
  good for
    translation
    summarization
  can be a mix of encoder and decoder models
  models
    mt5
    bart
    prophetNet

timeline
  RNN model using scrappy techniques (published 2021)
    https://towardsdatascience.com/create-your-own-artificial-shakespeare-in-10-minutes-with-natural-language-processing-1fde5edc8f28
  developed a conversations mining bot to extract conversations in a discord server
  could not use the bot for DMs so had to use a risky open source software for it
    https://github.com/Tyrrrz/DiscordChatExporter
  read up on what task chatting may beling in
    https://cobusgreyling.medium.com/how-to-use-huggingface-in-your-chatbot-216ecc9a2170
  came across an article for deploying on aws
    https://towardsdatascience.com/deploy-chatbots-in-web-sites-using-hugging-face-dlcs-be59a86fd7ba
  found the task: conversational
  found an article handling the same challenge
    https://www.freecodecamp.org/news/discord-ai-chatbot/
  cleaned the content from any unfamiliar text
    took a bit longer than expected
    the original reddit dataset did not contain any urls, emojis, or special patterns/symbols
  https://colab.research.google.com/drive/15wa925dj7jvdvrz8_z3vU7btqAFQLVlG#scrollTo=CjZaN5ilgd-z
    provided a guide for pretraining the model
  Seq2Seq pipeline was not suitable for the Dialo model
  Could not build the bot with only huggingface had to turn to the following guide
    https://colab.research.google.com/drive/15wa925dj7jvdvrz8_z3vU7btqAFQLVlG#scrollTo=CjZaN5ilgd-z
  After hours of training on colab I had only a few minutes remaining before my runtime disconnected and lost all my progress
    lesson: make sure your model is saved throught the process
  The way the data was split into train test was dynamic and not static
  created a manual pipeline that flows without issues
    still need to improve and standardize the inputs before automating it
  the .ipynb files display all the output data on git, therefore I have to restart git
  learned to develop cli with config to then implement
  need to restructure code to work as a singular bot
    begin by directing user to create config file
      discord bot API key
      discord guild name
      discord user to mimic
      huggingface API key
      model name
      amt of context
  create tests and raise errors for mine_data
    adding more init tests
    create mine tests and raise errors
  need to deal with huggingface api key
    use official cli
  set up the default init path to guide to config path
    if default init is config then simply add it to [general]
  need to add a way to extrapolate data by randomizing context
    added data extrapolation option
    finished
  added an option to extrapolate data
  separated mine and preprocess
  added preprocess tests & updated mine tests
  implmented a new form of passing a path to the cli which can reference the config
  at some point i need to flip the guild and session directories to read "../session/guild"
  test to make sure it works on CPU
  ensure saving of models happens in data_path
  upload the following issue at some point
    i cannot delete folders as that will require permissions so use_temp_dir wont happen
    https://stackoverflow.com/questions/34716996/cant-remove-a-file-which-created-by-tempfile-mkstemp-on-windows
  upload the following issue at some point https://github.com/huggingface/hub-docs/issues/new?assignees=&labels=new-task&template=adding-a-new-task-.md&title=Tracking+integration+for+%5BTASK%5D
    When searching through docs, the "source documentation" is sometimes not suggested.
      example: https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.create_repo
  steps to scripting train
    get config file
    eliminate unecessary portions or at least clean them
    pass api keys all throught the app
  updated cli init
  clean data
  train
    clean out excess outputs
      fine tuning from
        "Fine tuning model from: https://huggingface.co/{MODEL_FROM}\n"
      initialize the repo
        "Huggingface repo initialized at:\n{}\n"
        or
        "Huggingface repo already exists at:\n{}\n"
      begins with a benchmark test result
        "Initial benchmark test:\n{}\n"
      Begin training
        "Training started at:\n{}\n"
      Iteration
        "Iteration #{}:"
      New benchmark
        "Benchmark after iteration #{}:\n{}\n"
      Upload model begin
        "Uploading model to:\n{}\n"
      Upload model finished
        "Uploading finished, view it at:\n{}\n"
      End training
        "Training ended at:\n{}\n"

    left off logging at "def main"

    ensure continued training works
    
    run tests
  
  the following error could have easily been overcome by hinting "dirs_exist_ok" in shutil
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\colab'
  it seems as if huggingface restricts your user to a limited amount of simultaneous requests to huggingface
    I was uploading a model while fetching for a from their in inference api and I got an internal server error


  CONTINUE IMPLEMENTING "ACTIVATE". LEFT OFF AFTER SETTING UP CLEAN_DF
    had to reorganize the context_window and context_length so that there would always be a context length used for the same purpose (determining context length)
    will need to add more ModelSave to make up for the environment variables

  I was using the deprecated VSC extesion Python-autopep8 and it replaced entire files

  emojis in channels cause breaks

  initialize mimicbot
    create cli
    have cli run discord.bot
    discord bot should call from users api
    
  create end to end cli command
    create cli
    have cli run mimicbot
    create test

  deal with out of memory issues
  create a mimicbot-deploy repo
  create github ci/cd
    after navigating the errors my commit history has become a mess
  
  create train test
  create activate test
  create e2e test


  maybe modify num train epochs
  configure github actions
[{"url": "https://huggingface.co/SebastianS/mimicbot-MetalSebastian", "context_length": 2, "data_path": "C:\\Projects\\MimicBot\\data\\The Cocky Crew\\MetalSebastian"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalSebastian", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalSebastian"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalTaj", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalTaj"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalJessie", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalJessie"}, {"url": "https://huggingface.co/SebastianS/mimicbot-MetalSam_v2", "context_length": 2, "data_path": "C:\\Users\\1seba\\AppData\\Roaming\\mimicbot\\data\\The Cocky Crew\\MetalSam-v2"}]
tutorial
  cli guide not read
  intentions error
    discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents that have not been explicitly enabled in the developer portal. It is recommended to go to https://discord.com/developers/applications/ and explicitly enable the privileged intents within your application's page.
  fix low test data error
    Traceback (most recent call last):
    File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Python310\lib\runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\__main__.py", line 7, in <module>
      main()
    File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\__main__.py", line 4, in main
      cli.app(prog_name=__app_name__)
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\typer\main.py", line 214, in __call__
      return get_command(self)(*args, **kwargs)
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1055, in main
      rv = self.invoke(ctx)
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1657, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\click\core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "C:\Projects\MimicBot\mimicbotwrapper\env\lib\site-packages\typer\main.py", line 500, in wrapper
      return callback(**use_params)  # type: ignore
    File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\cli.py", line 318, in train_model
      res, error = train.train(session_path)
    File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\train.py", line 729, in train
      main(trn_df, val_df, args, True)
    File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\train.py", line 716, in main
      result = evaluate(args, model, tokenizer,
    File "C:\Projects\MimicBot\mimicbotwrapper\mimicbot\train.py", line 587, in evaluate
      eval_loss = eval_loss / nb_eval_steps
    ZeroDivisionError: float division by zero
  requirements section
discord-independent-mimic
  create a command for converting data into propper format
    propper format
      contains only content and author_id
      in order from most recent to oldest (top to down)
    will also contain a members file if it does not exist
  adapt the preprocess command and the train command to use the propper format
  option at init for wether it is an independent bot
    True = path to messages file
    False = continue to discord settings
  step mine data will be skipped if True
  mine will convert raw_data to raw_template_messages

  introduced possibility of context to be pulled from different channel at channel splits 

  will need to change init dramatically
    will change all discord inputs into a callback inside the custom training data
    should i make a new init command
  need to eliminate dependance on guild for path
  need to change target user to be "general" not "discord"
  the messages context may be fed in reverse
  connect the files for mimicbot and mimicbot-deploy
  to have a single source of truth

  fix enter name for member with id by having the default be the id
  create a pip package for mimicbot-chat
    update mimicbot-deploy
  allow bots to mention
  create a pip package for mimicbot
  fix tests on github actions