[Lab 1] Το FST της γραμματικής δεν είναι στοχαστικό #129

lathanasiadis · 2024-04-20T11:02:04Z

O έλεγχος fstisstochastic που περιείχε το timit_format_data.sh αποτυγχάνει για το FST που φτιάχνουμε στο βήμα 7 του ερωτήματος 4.2. Αυτό είναι αναμενόμενο ή έχουμε κάνει λάθος σε κάτι;

Το ίδιο πρόβλημα εμφανίζεται και για τους HCLG γράφους για τα bigram μοντέλα.

The text was updated successfully, but these errors were encountered:

georgepar · 2024-04-22T13:25:34Z

Έχετε μήπως κάποιο σχετικό log?

lathanasiadis · 2024-04-23T06:55:56Z

Σχετικός κώδικας:

lang=data/lang_bg
gunzip -c $LMDIR/lm_phone_bg.arpa.gz | \
arpa2fst --disambig-symbol=#0 \
    --read-symbol-table=$lang/words.txt - $lang/G.fst
fstisstochastic $lang/G.fst || echo "Fst is not stochastic"

Output:

arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_bg/words.txt - data/lang_bg/G.fst 
LOG (arpa2fst[5.5.1126~1-8c451]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.1126~1-8c451]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.1126~1-8c451]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
WARNING (arpa2fst[5.5.1126~1-8c451]:ConsumeNGram():arpa-lm-compiler.cc:313) line 52 [-2.82084	<s> <s>] skipped: n-gram has invalid BOS/EOS placement
LOG (arpa2fst[5.5.1126~1-8c451]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 42 to 42
fstisstochastic data/lang_bg/G.fst 
0.00142338 -0.0828663
Fst is not stochastic

Τα logs της build-lm.sh και compile-lm δεν έχουν warnings.

build-lm.sh:

LOGFILE:/dev/stdout
BIS LOGFILE:/dev/stdout
Temporary directory stat_7807 does not exist
creating stat_7807
Extracting dictionary from training corpus
dict:
loaded 
Splitting dictionary into 3 lists
Dictionary 0: (thr: 15507 , 46521, 0 , 3)
Dictionary 1: (thr: 15334.5 , 30669 , 15852 , 2)
Dictionary 2: (thr: 14730 , 14730 , 15939 , 1)
Extracting n-gram statistics for each word list
Important: dictionary must be ordered according to order of appearance of words in data
used to generate n-gram blocks,  so that sub language model blocks results ordered too
Extracting n-gram statistics for dict.000
Extracting n-gram statistics for dict.001
[codesize 3]
Extracting n-gram statistics for dict.002
$bin/ngt -i="$inpfile" -n=$order -gooout=y -o="$gzip -c > $tmpdir/ngram.${sdict}.gz" -fd="$tmpdir/$sdict" $dictionary $additional_parameters >> $logfile 2>&1
[codesize 3]
[codesize 3]
dict:loaded 
load:prepare initial n-grams to make table consistent
starting to use OOV words [<s>]
dict:loaded 
load:prepare initial n-grams to make table consistent
starting to use OOV words [ay]
+2dicloaded 
load:prepare initial n-grams to make table consistent
starting to use OOV words [<s>]
adding some more n-grams to make table consistent

savetxt in Google format: nGrAm 2 296 ngram
+2
adding some more n-grams to make table consistent

savetxt in Google format: nGrAm 2 282 ngram
adding some more n-grams to make table consistent

savetxt in Google format: nGrAm 2 532 ngram


Estimating language models for each word list
Estimating language models for dict.000
Estimating language models for dict.001
Estimating language models for dict.002
$scr/build-sublm.pl $verbose $prune $prune_thr_str $smoothing "$additional_smoothing_parameters" --size $order --ngrams "$gunzip -c $tmpdir/ngram.${sdict}.gz" -sublm $tmpdir/lm.$sdict $additional_parameters >> $logfile 2>&1
Merging language models into data/local/nist_lm/bigram.ilm.gz
merge-sublm.pl --size 2 --sublm stat_7807/lm.dict --lm data/local/nist_lm/bigram.ilm.gz --backoff 0
Compute total sizes of n-grams
join files stat_7807/lm.dict.000.1gr.gz stat_7807/lm.dict.001.1gr.gz stat_7807/lm.dict.002.1gr.gz
implicitely add <unk> word to counters
n:1 size:43 unk:0
join files stat_7807/lm.dict.000.2gr.gz stat_7807/lm.dict.001.2gr.gz stat_7807/lm.dict.002.2gr.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.000.2gr.gz | grep -v '10000.000' | wc -l > wc7873
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.001.2gr.gz | grep -v '10000.000' | wc -l > wc7873
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.002.2gr.gz | grep -v '10000.000' | wc -l > wc7873
n:2 size:1109 unk:0
Merge all sub LMs
Write LM Header
Writing LM Tables
Level 1
input from: stat_7807/lm.dict.000.1gr.gz stat_7807/lm.dict.001.1gr.gz stat_7807/lm.dict.002.1gr.gz
Level 2
input from: stat_7807/lm.dict.000.2gr.gz stat_7807/lm.dict.001.2gr.gz stat_7807/lm.dict.002.2gr.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.000.2gr.gz | grep -v '10000.000' | gzip -c >> data/local/nist_lm/bigram.ilm.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.001.2gr.gz | grep -v '10000.000' | gzip -c >> data/local/nist_lm/bigram.ilm.gz
Executing: /usr/bin/gunzip -c stat_7807/lm.dict.002.2gr.gz | grep -v '10000.000' | gzip -c >> data/local/nist_lm/bigram.ilm.gz
Cleaning temporary directory stat_7807
Removing temporary directory stat_7807

compile-lm:

inpfile: data/local/nist_lm/bigram.ilm.gz
outfile: /dev/stdout
loading up to the LM level 1000 (if any)
dub: 10000000
OOV code is 42
OOV code is 42
Saving in txt format to /dev/stdout

georgepar · 2024-04-23T10:22:54Z

skipped: n-gram has invalid BOS/EOS placement

Σύμφωνα με αυτό φαίνεται δεν έχετε βάλει σωστά τα / στο κείμενο που εκπαιδεύετε το LM.

Κάθε γραμμή πρέπει να έχει το παρακάτω φορμάτ.

<s> Αυτή είναι μια γραμμή </s>

lathanasiadis · 2024-04-23T17:02:36Z

Τα αρχεία αυτά είναι τα lm_train.text που χρησιμοποιούμε στην build-lm.sh, σωστά; Τα έχουμε στην σωστή μορφή:

$ cat data/train/lm_train.text | head -n 2
<s> sil sh iy ih z th ih n er dh ae n ay ae m sil </s>
<s> sil b r ay t s ah n sh ay n sh ih m er z aa n dh iy ow sh ah n sil </s>

Επιπλέον δοκιμάσαμε να βγάλουμε το sil από αυτά τα αρχεία, δηλαδή

$ cat data/train/lm_train.text | head -n 2
<s> sh iy ih z th ih n er dh ae n ay ae m </s>
<s> b r ay t s ah n sh ay n sh ih m er z aa n dh iy ow sh ah n </s>

Εμφανίζεται το ίδιο πρόβλημα με ένα λιγότερο state και ελαφρώς αλλαγμένα νούμερα:

WARNING (arpa2fst[5.5.1126~1-8c451]:ConsumeNGram():arpa-lm-compiler.cc:313) line 51 [-2.82995	<s> <s>] skipped: n-gram has invalid BOS/EOS placement
LOG (arpa2fst[5.5.1126~1-8c451]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 41 to 41
fstisstochastic data/lang_bg/G.fst 
0.00103338 -0.0851021
Fst is not stochastic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lab 1] Το FST της γραμματικής δεν είναι στοχαστικό #129

[Lab 1] Το FST της γραμματικής δεν είναι στοχαστικό #129

lathanasiadis commented Apr 20, 2024 •

edited

Loading

georgepar commented Apr 22, 2024

lathanasiadis commented Apr 23, 2024

georgepar commented Apr 23, 2024 •

edited

Loading

lathanasiadis commented Apr 23, 2024

[Lab 1] Το FST της γραμματικής δεν είναι στοχαστικό #129

[Lab 1] Το FST της γραμματικής δεν είναι στοχαστικό #129

Comments

lathanasiadis commented Apr 20, 2024 • edited Loading

georgepar commented Apr 22, 2024

lathanasiadis commented Apr 23, 2024

georgepar commented Apr 23, 2024 • edited Loading

lathanasiadis commented Apr 23, 2024

lathanasiadis commented Apr 20, 2024 •

edited

Loading

georgepar commented Apr 23, 2024 •

edited

Loading