Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

shikhargoswami
Copy link

Hello everyone,
This is a PR for adding GPT2 tokenizer in extending pretrained tokenizers in WordTokenizers.jl. This might be helpful in future if developing end-to-end pipeline on top of GPT2 model in Julia.
Though I have added tests, suggestions/corrections would be helpful :)

@shikhargoswami shikhargoswami changed the title Adding GPT2 Tokenizer for WordEmbeddings' Pretrained tokenizers Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers Mar 17, 2021
Project.toml Show resolved Hide resolved
Manifest.toml Outdated Show resolved Hide resolved
tokens = tokenize("I love julia language", gpt2_tokenizer)
@test ids_from_tokens(tokens, gpt2_tokenizer) == [40, 1842, 474, 43640, 3303]
@test sentence_from_tokens_gpt2(tokens) == "I love julia language"
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test a few more edge cases, rather than just the base case?

"""
function load(path; unk_token="<unk>")
"""
function load_sp(path; unk_token="<unk>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the rationale of this this change. Why is this needed? @oxinabox or @Ayushk4 should take a look here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load function in GPT2 tokenizer was overriding this function. So, I changed it to separate functions that can be called by main load metthod. I'm not sure whether this way is optimized to performance or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, It is fine to use load_sp (better call it load_spu (sentencepiece unigram)) and load_gpt2 so that we can call it from the main load

@@ -17,7 +17,7 @@ export poormans_tokenize, punctuation_space_tokenize,
set_tokenizer, set_sentence_splitter,
rev_tokenize, rev_detokenize,
toktok_tokenize
export ALBERT_V1, ALBERT_V2, load, tokenizer, sentence_from_tokens, ids_from_tokens
export ALBERT_V1, ALBERT_V2, load, tokenizer, sentence_from_tokens, ids_from_tokens, GPT2, GPT2Tokenizer, tokenize, sentence_from_tokens_gpt2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all these exports needed? In particular I'm afraid that the name GPT2 here will clash with the actual model whenever implemented.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right. There might be a better alternative for this. It is consistent with ALBERT_V1 and goes with load(ALBERT_v1) so i did this. I just realised there's no need to export GPT2Tokenizer as well. I'll correct it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to have common APIs for all the statical Tokenizers
For instance, sentence_from_tokens can be shared between ALBERT and GPT2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasvaidhyadev Can I make all the API as of the format f(tokens/text, tokenizer(spm/gpt2)) instead of f(tokenizer(spm/gpt2), tokens/text)? I feel it might be more intuitive for users this way.

@shikhargoswami
Copy link
Author

I don't know why it is getting this build error on julia_version=1.1 @aviks @Ayushk4 @oxinabox help needed.

Testing WordTokenizers
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package MbedTLS [739be429]:
 MbedTLS [739be429] log:
 ├─possible versions are: [0.5.13-0.5.14, 0.6.0-0.6.8, 0.7.0, 1.0.0-1.0.3] or uninstalled
 ├─restricted to versions 1.0.3 by an explicit requirement, leaving only versions 1.0.3
 └─restricted by julia compatibility requirements to versions: [0.5.13-0.5.14, 0.6.0-0.6.8] or uninstalled — no versions left
Stacktrace:
 [1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int32}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1007
 [2] propagate_constraints! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:948 [inlined]
 [3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int32}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1462
 [4] simplify_graph! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1462 [inlined] (repeats 2 times)
 [5] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Nothing) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:371
 [6] resolve_versions! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:315 [inlined]
 [7] #add_or_develop#63(::Array{Base.UUID,1}, ::Symbol, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1172
 [8] add_or_develop at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1156 [inlined]
 [9] (::getfield(Pkg.Operations, Symbol("##40#44")){Bool,getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}},Pkg.Types.Context,Pkg.Types.PackageSpec,Pkg.Types.Context})(::String) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:874
 [10] mktempdir(::getfield(Pkg.Operations, Symbol("##40#44")){Bool,getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}},Pkg.Types.Context,Pkg.Types.PackageSpec,Pkg.Types.Context}, ::String) at .\file.jl:581
 [11] mktempdir at .\file.jl:579 [inlined]
 [12] #with_dependencies_loadable_at_toplevel#38(::Bool, ::Function, ::getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}}, ::Pkg.Types.Context, ::Pkg.Types.PackageSpec) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:853
 [13] #with_dependencies_loadable_at_toplevel at .\none:0 [inlined]
 [14] #test#66(::Bool, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1319
 [15] #test at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:0 [inlined]
 [16] #test#46(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:198
 [17] #test at .\none:0 [inlined]
[18] #test#45 at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:180 [inlined]
 [19] #test at .\none:0 [inlined]
 [20] #test#42 at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:177 [inlined]
 [21] (::getfield(Pkg.API, Symbol("#kw##test")))(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test)) at .\none:0
 [22] top-level scope at none:0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants