Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lab 1] Το FST της γραμματικής δεν είναι στοχαστικό #129

Open
lathanasiadis opened this issue Apr 20, 2024 · 4 comments

Comments

@lathanasiadis
Copy link

lathanasiadis commented Apr 20, 2024

O έλεγχος fstisstochastic που περιείχε το timit_format_data.sh αποτυγχάνει για το FST που φτιάχνουμε στο βήμα 7 του ερωτήματος 4.2. Αυτό είναι αναμενόμενο ή έχουμε κάνει λάθος σε κάτι;

Το ίδιο πρόβλημα εμφανίζεται και για τους HCLG γράφους για τα bigram μοντέλα.

@georgepar
Copy link
Contributor

Έχετε μήπως κάποιο σχετικό log?

@lathanasiadis
Copy link
Author

Σχετικός κώδικας:

lang=data/lang_bg
gunzip -c $LMDIR/lm_phone_bg.arpa.gz | \
arpa2fst --disambig-symbol=#0 \
    --read-symbol-table=$lang/words.txt - $lang/G.fst
fstisstochastic $lang/G.fst || echo "Fst is not stochastic"

Output:

arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_bg/words.txt - data/lang_bg/G.fst 
LOG (arpa2fst[5.5.1126~1-8c451]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.1126~1-8c451]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.1126~1-8c451]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
WARNING (arpa2fst[5.5.1126~1-8c451]:ConsumeNGram():arpa-lm-compiler.cc:313) line 52 [-2.82084	<s> <s>] skipped: n-gram has invalid BOS/EOS placement
LOG (arpa2fst[5.5.1126~1-8c451]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 42 to 42
fstisstochastic data/lang_bg/G.fst 
0.00142338 -0.0828663
Fst is not stochastic

Τα logs της build-lm.sh και compile-lm δεν έχουν warnings.

build-lm.sh:

LOGFILE:/dev/stdout
BIS LOGFILE:/dev/stdout
Temporary directory stat_7807 does not exist
creating stat_7807
Extracting dictionary from training corpus
dict:
loaded 
Splitting dictionary into 3 lists
Dictionary 0: (thr: 15507 , 46521, 0 , 3)
Dictionary 1: (thr: 15334.5 , 30669 , 15852 , 2)
Dictionary 2: (thr: 14730 , 14730 , 15939 , 1)
Extracting n-gram statistics for each word list
Important: dictionary must be ordered according to order of appearance of words in data
used to generate n-gram blocks,  so that sub language model blocks results ordered too
Extracting n-gram statistics for dict.000
Extracting n-gram statistics for dict.001
[codesize 3]
Extracting n-gram statistics for dict.002
$bin/ngt -i="$inpfile" -n=$order -gooout=y -o="$gzip -c > $tmpdir/ngram.${sdict}.gz" -fd="$tmpdir/$sdict" $dictionary $additional_parameters >> $logfile 2>&1
[codesize 3]
[codesize 3]
dict:loaded 
load:prepare initial n-grams to make table consistent
starting to use OOV words [<s>]
dict:loaded 
load:prepare initial n-grams to make table consistent
starting to use OOV words [ay]
+2dicloaded 
load:prepare initial n-grams to make table consistent
starting to use OOV words [<s>]
adding some more n-grams to make table consistent

savetxt in Google format: nGrAm 2 296 ngram
+2
adding some more n-grams to make table consistent

savetxt in Google format: nGrAm 2 282 ngram
adding some more n-grams to make table consistent

savetxt in Google format: nGrAm 2 532 ngram


Estimating language models for each word list
Estimating language models for dict.000
Estimating language models for dict.001
Estimating language models for dict.002
$scr/build-sublm.pl $verbose $prune $prune_thr_str $smoothing "$additional_smoothing_parameters" --size $order --ngrams "$gunzip -c $tmpdir/ngram.${sdict}.gz" -sublm $tmpdir/lm.$sdict $additional_parameters >> $logfile 2>&1
Merging language models into data/local/nist_lm/bigram.ilm.gz
merge-sublm.pl --size 2 --sublm stat_7807/lm.dict --lm data/local/nist_lm/bigram.ilm.gz --backoff 0
Compute total sizes of n-grams
join files stat_7807/lm.dict.000.1gr.gz stat_7807/lm.dict.001.1gr.gz stat_7807/lm.dict.002.1gr.gz
implicitely add <unk> word to counters
n:1 size:43 unk:0
join files stat_7807/lm.dict.000.2gr.gz stat_7807/lm.dict.001.2gr.gz stat_7807/lm.dict.002.2gr.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.000.2gr.gz | grep -v '10000.000' | wc -l > wc7873
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.001.2gr.gz | grep -v '10000.000' | wc -l > wc7873
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.002.2gr.gz | grep -v '10000.000' | wc -l > wc7873
n:2 size:1109 unk:0
Merge all sub LMs
Write LM Header
Writing LM Tables
Level 1
input from: stat_7807/lm.dict.000.1gr.gz stat_7807/lm.dict.001.1gr.gz stat_7807/lm.dict.002.1gr.gz
Level 2
input from: stat_7807/lm.dict.000.2gr.gz stat_7807/lm.dict.001.2gr.gz stat_7807/lm.dict.002.2gr.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.000.2gr.gz | grep -v '10000.000' | gzip -c >> data/local/nist_lm/bigram.ilm.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.001.2gr.gz | grep -v '10000.000' | gzip -c >> data/local/nist_lm/bigram.ilm.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.002.2gr.gz | grep -v '10000.000' | gzip -c >> data/local/nist_lm/bigram.ilm.gz
Cleaning temporary directory stat_7807
Removing temporary directory stat_7807

compile-lm:

inpfile: data/local/nist_lm/bigram.ilm.gz
outfile: /dev/stdout
loading up to the LM level 1000 (if any)
dub: 10000000
OOV code is 42
OOV code is 42
Saving in txt format to /dev/stdout

@georgepar
Copy link
Contributor

georgepar commented Apr 23, 2024

skipped: n-gram has invalid BOS/EOS placement

Σύμφωνα με αυτό φαίνεται δεν έχετε βάλει σωστά τα / στο κείμενο που εκπαιδεύετε το LM.

Κάθε γραμμή πρέπει να έχει το παρακάτω φορμάτ.

<s> Αυτή είναι μια γραμμή </s>

@lathanasiadis
Copy link
Author

Τα αρχεία αυτά είναι τα lm_train.text που χρησιμοποιούμε στην build-lm.sh, σωστά; Τα έχουμε στην σωστή μορφή:

$ cat data/train/lm_train.text | head -n 2
<s> sil sh iy ih z th ih n er dh ae n ay ae m sil </s>
<s> sil b r ay t s ah n sh ay n sh ih m er z aa n dh iy ow sh ah n sil </s>

Επιπλέον δοκιμάσαμε να βγάλουμε το sil από αυτά τα αρχεία, δηλαδή

$ cat data/train/lm_train.text | head -n 2
<s> sh iy ih z th ih n er dh ae n ay ae m </s>
<s> b r ay t s ah n sh ay n sh ih m er z aa n dh iy ow sh ah n </s>

Εμφανίζεται το ίδιο πρόβλημα με ένα λιγότερο state και ελαφρώς αλλαγμένα νούμερα:

WARNING (arpa2fst[5.5.1126~1-8c451]:ConsumeNGram():arpa-lm-compiler.cc:313) line 51 [-2.82995	<s> <s>] skipped: n-gram has invalid BOS/EOS placement
LOG (arpa2fst[5.5.1126~1-8c451]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 41 to 41
fstisstochastic data/lang_bg/G.fst 
0.00103338 -0.0851021
Fst is not stochastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants