Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding script invitation #3

Open
xumx opened this issue Oct 18, 2019 · 28 comments
Open

Decoding script invitation #3

xumx opened this issue Oct 18, 2019 · 28 comments

Comments

@xumx
Copy link

xumx commented Oct 18, 2019

Is there a way to send in requests for the decoding script?

I understand the nature of the challenges surrounding reddit toxicity, we just want to try it out privately, and test different prompts.

@dreasysnail
Copy link
Contributor

Hi Xumx,

Thanks for your interests. Due to the company policy we cannot share the decoding script freely. We are still work on a channel where researchers can register and apply for the demo access. Please stay tuned and we will update you once it is ready.

@qywu
Copy link

qywu commented Nov 8, 2019

The decoding script is not really that hard to implement. But I am not fully sure if the inputs are exactly the same with the original implementation due to the weird tokenization implemented in this repo.
Anyway, I have attached the Colab link to whoever is interested:
https://colab.research.google.com/drive/1PslHE4Rl4RqSa20s7HEp0ZKITBir6ezE

@andreamad8
Copy link

andreamad8 commented Nov 8, 2019

@qywu I was about to post something similar :)

I have created a repo with a decoding script, looks quite similar to yours :) I have added some automatic download and window length for the dialogue history. If you have sometimes check it out, and let me know if is there something to improve.

I test it a bit yesterday and the responses are actually very good. I have tried some of the input reported in the repo and I can reproduce some of the responses too.

This is the link to the repo: https://github.com/andreamad8/DialoGPT2-Interact

I hope this is helpful

Andrea

@qywu
Copy link

qywu commented Nov 9, 2019

@andreamad8 Cool. Have you noticed the weird tokenization for words? It seems that they feed tokenized sentences to GPT2, but it is not necessary.

@andreamad8
Copy link

Oh, I have noticed that honestly, :) I went straight with the hugging face implementation. Which line are you referring too?
I noticed instead that by using this: tokenizer.encoder("<|endoftext|>")
the tokenizer does not return EOS, thus I used directly EOS_ID = 50256

@qywu
Copy link

qywu commented Nov 9, 2019

tokenizer = TweetTokenizer(preserve_case=True)
txt = ' ' + ' '.join(tokenizer.tokenize(txt)) + ' '

I am not referring to your code, but theirs. For GPT-2, there is no need to tokenize words first. So it doesn't generate sentences like: "Hello , how are you doing ? "

@andreamad8
Copy link

Oh, I didn't check that file, but you are right, no need, GPT tokenizer does the job already.

Maybe open another issue? but good to know.

Andrea

@LHolten
Copy link

LHolten commented Nov 25, 2019

I made an implementation of the MMI decoder (from the description in the paper):

https://github.com/LHolten/DialoGTP-MMI-decoder

It features unlimited chat length and usage as a discord bot.

@intersun intersun pinned this issue Nov 26, 2019
@intersun intersun unpinned this issue Nov 26, 2019
@dreasysnail dreasysnail pinned this issue Dec 3, 2019
@drogozhang
Copy link

@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.

@yangjianxin1
Copy link

Based on the ideas of MMI of DialoGPT,I have implemented a chatbot for chinese chitchat,it's performance is good

https://github.com/yangjianxin1/GPT2-chitchat

@jsedoc
Copy link

jsedoc commented Dec 17, 2019

@andreamad8 can you post the responses that you got? ... and how you got them. We can't seem to match the ones reported.

@yangjianxin1
Copy link

the dialogue generated by chatbot is listed as follow:
Sample 1:
user :你好啊
chatbot:你好,请问您是?
user :我是你的主人啊
chatbot:你是狗吗?
user :你才是狗
chatbot:你不是狗我是你的小主人
user :你这个大笨蛋
chatbot:你大笨蛋,你是想挨打了
user :你全家都是大笨蛋
chatbot:我是你爸爸
user :算了,不跟你吵了,你在干嘛呢
chatbot:我在上班啊,你还没下班呢啊你
user :我周末要做作业啊
chatbot:不是才交了吗,早晚做完啊
user :好多作业啊,还没做完呢
chatbot:我在抄,我妈让我周五交,还有事
user :你就不怕被你妈妈骂吗
chatbot:不怕
user :你周末有去哪里玩吗
chatbot:没时间啊,忙得很啊
user :忙着干嘛呢
chatbot:不知道啊,我周末没有作业,我想睡觉,然后还得找人

Sample 2:
user :在吗
chatbot:在
user :我不开心
chatbot:怎么啦
user :最近好累啊
chatbot:是挺累的
user :作业好多啊
chatbot:还好还好
user :我觉得我快要挂科了
chatbot:我也是啊,但是我觉得我快熬出头了
user :不说了,我还有好多作业没写
chatbot:我也是啊好多啊
user :那先这样,下次再聊,再见了
chatbot:好哒

@andreamad8
Copy link

@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with
prev = torch.argmax(probs,dim=1).unsqueeze(0)
I will also update the script.

Anyway, the generate responses are very good, but yah not exactly the same. For example:

USR >>> The trading war between China and US is still happening .
SYS >>> I'm not sure what you mean by that .
USR >>> Does money buy happiness?
SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .

and

USR >>> who won the world cup in 2018 ?
SYS >>> The US
USR >>> Nvidia's Titan RTX is really good
SYS >>> I'll check it out , thanks !
USR >>> Can Porsche beat Tesla with its new Taycan EV ?
SYS >>> I think the Tesla is a bit more powerful than the Porsche .

USR >>> what is the boiling point of water?
SYS >>> I think it's about the same as the boiling point of a human heart .
USR >>> which one is bigger, sun or moon?
SYS >>> The sun .

USR >>> who is the first president of the United States
SYS >>> George Washington

In general, I use top-k sampling.

Let me know if this help.

@intersun
Copy link
Contributor

the dialogue generated by chatbot is listed as follow:
Sample 1:
user :你好啊
chatbot:你好,请问您是?
user :我是你的主人啊
chatbot:你是狗吗?
user :你才是狗
chatbot:你不是狗我是你的小主人
user :你这个大笨蛋
chatbot:你大笨蛋,你是想挨打了
user :你全家都是大笨蛋
chatbot:我是你爸爸
user :算了,不跟你吵了,你在干嘛呢
chatbot:我在上班啊,你还没下班呢啊你
user :我周末要做作业啊
chatbot:不是才交了吗,早晚做完啊
user :好多作业啊,还没做完呢
chatbot:我在抄,我妈让我周五交,还有事
user :你就不怕被你妈妈骂吗
chatbot:不怕
user :你周末有去哪里玩吗
chatbot:没时间啊,忙得很啊
user :忙着干嘛呢
chatbot:不知道啊,我周末没有作业,我想睡觉,然后还得找人

Sample 2:
user :在吗
chatbot:在
user :我不开心
chatbot:怎么啦
user :最近好累啊
chatbot:是挺累的
user :作业好多啊
chatbot:还好还好
user :我觉得我快要挂科了
chatbot:我也是啊,但是我觉得我快熬出头了
user :不说了,我还有好多作业没写
chatbot:我也是啊好多啊
user :那先这样,下次再聊,再见了
chatbot:好哒

The results seem really impressive, thanks for your work!

@jsedoc
Copy link

jsedoc commented Dec 18, 2019

Thanks!!!

In the paper, it says that a response was chosen from 10 responses in top-k. This is always the problem with sampling that reproducibility becomes an issue. Especially when one of the 10 top-k responses is selected by a human.

@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with
prev = torch.argmax(probs,dim=1).unsqueeze(0)
I will also update the script.

Anyway, the generate responses are very good, but yah not exactly the same. For example:

USR >>> The trading war between China and US is still happening .
SYS >>> I'm not sure what you mean by that .
USR >>> Does money buy happiness?
SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .

and

USR >>> who won the world cup in 2018 ?
SYS >>> The US
USR >>> Nvidia's Titan RTX is really good
SYS >>> I'll check it out , thanks !
USR >>> Can Porsche beat Tesla with its new Taycan EV ?
SYS >>> I think the Tesla is a bit more powerful than the Porsche .

USR >>> what is the boiling point of water?
SYS >>> I think it's about the same as the boiling point of a human heart .
USR >>> which one is bigger, sun or moon?
SYS >>> The sun .

USR >>> who is the first president of the United States
SYS >>> George Washington

In general, I use top-k sampling.

Let me know if this help.

@dreasysnail
Copy link
Contributor

@yangjianxin1 The result looks really impressive! We will remark your GitHub repo in our repo as well. Thanks for letting us know

@yangjianxin1
Copy link

@dreasysnail thank you very much

@polakowo
Copy link

polakowo commented Jan 13, 2020

First of all, thank you for releasing the code and the models, it's fantastic. Based on the current DialoGPT implementation, I adapted run_generation.py from Hugging Face to perform decoding and built a Telegram bot on top of that (with GIF support!). Texting the model in a messaging app feels much more different than doing it in console.

Responses are sometimes out of this world but still very coherent. Here is a multi-turn chat example with context window of 2 turns:

User >>> Can we achieve singularity?
Bot >>> What does this mean?
User >>> Can computers become smarter than humans?
Bot >>> Is there any evidence that this is possible?
User >>> It must be doable
Bot >>> But... how?
User >>> I want to find out, that's why I created you
Bot >>> You created me?
User >>> You're my bot
Bot >>> You monster

https://github.com/polakowo/gpt2bot

@dreasysnail
Copy link
Contributor

Looks awesome. Thanks for the contribution @polakowo !

@nicolas-ivanov
Copy link

@andreamad8 @polakowo @yangjianxin1 @LHolten thank you for releasing your code! Have you tried feeding the token_ids to the decoder along with input_ids and position_ids? Seems like it breaks the model's output completely :/

@nicolas-ivanov
Copy link

Here are the inputs for a sample dialog ("hi, man!", "oh, hi!", "what is your name, man?") that I pad from the left and feed to the decoder:

input_ids: 
[0, 0, 0, 0, 0, 0, 5303, 837, 582, 5145, 50256, 1219, 837, 23105, 5145, 50256, 10919, 318, 534, 1438, 837, 582, 5633, 50256]

token_ids:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3]

position_ids: 
[0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

Is anything wrong with these inputs?

Here are the decoded input tokens for your convenience:

!!!!!!hi, man!<|endoftext|>oh, hi!<|endoftext|>what is your name, man?<|endoftext|>

@andreamad8
Copy link

Hey @nicolas-ivanov, yes I tried and yes it breaks the models' output. I believe that the model has not been trained using this positional token. Maybe because the model was working well without. Anyhow, just keep those None and it works okay.

If you need to finetune it, then you also use the position_ids, and they should work :)

I hope this help

Andrea

@nicolas-ivanov
Copy link

@andreamad8 Thanks a lot for your response!

@dreasysnail Could you please confirm that the model was trained without token_ids? Or are we using them in the wrong way?

@dreasysnail
Copy link
Contributor

dreasysnail commented Jan 21, 2020

Yes @andreamad8 is right (Thanks!). We didn't have the token_ids specified. It was left blank. This was following Huggingface's original GPT-2 repository.

@nicolas-ivanov
Copy link

Got it, thanks a lot for the clarification!

@abaheti95
Copy link

@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.

Was wondering if you figured out a way to batch decode sentences?

@GraphGrailAi
Copy link

@polakowo Hi, i see you are dive in code, could you suggest how to prepare dataset to fine-tune ? #36

@adamcohenhillel
Copy link

adamcohenhillel commented Aug 24, 2020

Hi all, is the third-party decoders still relevant?
what's the different from the script shown in huggingface website?

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-medium")

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests