Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing a large TextDict fails on recent python versions #539

Open
NeoLegends opened this issue Sep 3, 2024 · 0 comments · May be fixed by #540
Open

Parsing a large TextDict fails on recent python versions #539

NeoLegends opened this issue Sep 3, 2024 · 0 comments · May be fixed by #540
Labels
bug Something isn't working

Comments

@NeoLegends
Copy link
Member

E.g. for LS960 you get:

OverflowError                             
Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/sisyphus/task.py:188, in Task.run(self=<Task 'run' job=Job<alias/datasets/LibriSpeech/t...ext/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>>, task_id=1, resume_job=False, logging_thread=<LoggingThread(Thread-2, started daemon 140001375073856)>)
    186             logging.info("Starting subtask for arg id: %d args: %s" % (arg_id, str(args)))
    187             logging.info("-" * 60)
--> 188             f(*args)
        f = <bound method TextDictToTextLinesJob.run of Job<alias/datasets/LibriSpeech/train_other_960_corpus_text_lines work/i6_core/text/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>>
        args = []
    189 except sp.CalledProcessError as e:
    190     if e.returncode == 137:
    191         # TODO move this into engine class

File recipe/i6_core/text/convert.py:33, in TextDictToTextLinesJob.run(self=Job<alias/datasets/LibriSpeech/train_other_960_c...text/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>)
     30 def run(self):
     31     # nan/inf should not be needed, but avoids errors at this point and will print an error below,
     32     # that we don't expect an N-best list here.
---> 33     d = eval(uopen(self.text_dict, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
        {"nan": float("nan"), "inf": float("inf")} = {'nan': nan, 'inf': inf}
        float("nan") = nan
        float("inf") = inf
        self = Job<alias/datasets/LibriSpeech/train_other_960_corpus_text_lines work/i6_core/text/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>
        self.text_dict = <Path work/i6_core/corpus/convert/CorpusToTextDictJob.JIQTGMdLEmbz/output/text_dictionary.py.gz>
     34     assert isinstance(d, dict)  # seq_tag -> text
     36     with uopen(self.out_text_lines, "wt") as out:

OverflowError: line number table is too long

Working on a fix.

@NeoLegends NeoLegends added the bug Something isn't working label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant