Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading pretrained model error #5

Open
EyjafjalIa opened this issue Jan 27, 2024 · 7 comments
Open

Loading pretrained model error #5

EyjafjalIa opened this issue Jan 27, 2024 · 7 comments

Comments

@EyjafjalIa
Copy link

EyjafjalIa commented Jan 27, 2024

Hey, I'm coming again! When I do Step 3. Train SLTUnet Model, I moved required files in two folders in train.sh file and run train.sh. When the code run to loading pretrained model, I got a warning below:

INFO:tensorflow:Trying restore pretrained parameters
WARNING:tensorflow:No Existing Model detected
INFO:tensorflow:Trying restore existing parameters
WARNING:tensorflow:No Existing Model detected

How can I load pretrained model? Is pretrained model trained in Step 2? Thanks!
This is my train.sh

data=preprocessed-corpus/
feature=smkd-sign-features/

python3 run.py --mode train --parameters=\
hidden_size=256,embed_size=256,filter_size=4096,\
sep_layer=0,num_encoder_layer=6,num_decoder_layer=6,\
ctc_enable=True,ctc_alpha=0.3,ctc_repeated=True,\
src_bpe_dropout=0.2,tgt_bpe_dropout=0.2,bpe_dropout_stochastic_rate=0.6,\
initializer="uniform_unit_scaling",initializer_gain=0.5,\
dropout=0.3,label_smooth=0.1,attention_dropout=0.3,relu_dropout=0.5,residual_dropout=0.4,\
max_len=256,max_img_len=512,batch_size=80,eval_batch_size=32,\
token_size=1600,batch_or_token='token',beam_size=8,remove_bpe=True,decode_alpha=1.0,\
scope_name="transformer",buffer_size=50000,data_leak_ratio=0.1,\
img_feature_size=1024,img_aug_size=11,\
clip_grad_norm=0.0,\
num_heads=4,\
process_num=2,\
lrate=1.0,\
estop_patience=100,\
warmup_steps=4000,\
epoches=5000,\
update_cycle=16,\
gpus=[0],\
disp_freq=1,\
eval_freq=500,\
sample_freq=100,\
checkpoints=5,\
best_checkpoints=10,\
max_training_steps=30000,\
nthreads=8,\
beta1=0.9,\
beta2=0.998,\
random_seed=1234,\
src_codes="$data/ende.bpe",tgt_codes="$data/ende.bpe",\
src_vocab_file="$data/vocab.zero.drop",\
tgt_vocab_file="$data/vocab.zero.drop",\
img_train_file="$feature/train.h5",\
src_train_file="$data/train.bpe.en.shuf",\
tgt_train_file="$data/train.bpe.de.shuf",\
img_dev_file="$feature/dev.h5",\
src_dev_file="$data/dev.bpe.en",\
tgt_dev_file="$data/dev.bpe.de",\
img_test_file="$feature/test.h5",\
src_test_file="$data/test.bpe.en",\
tgt_test_file="$data/test.bpe.de",\
output_dir="train",\
test_output="",\
shared_source_target_embedding=True,\
@bzhangGo
Copy link
Owner

Hey, the logging information is a little bit confusing here.

The pretrained model here doesn't mean the pretrained sign embeddings, but pretrained SLT model. so it's normal and not a problem. More details are below:

INFO:tensorflow:Trying restore pretrained parameters
WARNING:tensorflow:No Existing Model detected

It tries to restore a separately pretrained SLT model, e.g. pretrained encoders or decoders, which we never used.

INFO:tensorflow:Trying restore existing parameters
WARNING:tensorflow:No Existing Model detected

It tries to restore from existing working directory. If your job got corrupted, it should recover the training from the working directory, i.e. output_dir.

@EyjafjalIa
Copy link
Author

EyjafjalIa commented Jan 28, 2024

Oh! I'm sorry, loading pretrained model may not the important problem. The original error seems like h5 file.

Traceback (most recent call last):
  File "/data1/wanjiarui/anaconda3/envs/slt/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/data1/wanjiarui/sltunet/utils/queuer.py", line 125, in run
    for data_chunk in self._data_chunk_iterable:
  File "/data1/wanjiarui/sltunet/data.py", line 201, in batcher
    for data in _handle_buffer(buffer):
  File "/data1/wanjiarui/sltunet/data.py", line 184, in _handle_buffer
    x, s, t, m, mask, spar, img_idx = self.to_matrix(batch, train)
  File "/data1/wanjiarui/sltunet/data.py", line 136, in to_matrix
    new_image = self.img_reader[img_key][()]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data1/wanjiarui/anaconda3/envs/slt/lib/python3.6/site-packages/h5py/_hl/group.py", line 264, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object '5142_8' doesn't exist)"

When I run Step 2. command 4. I got dev.h5 test.h5 train.h5 and train_(0-9).h5 in path smkd/features and then I combine different training features and move dev/test/train.h5 to path smkd-sign-features/, which wrote in train.sh.
Finally, I run the train.sh by command below in root path sltunet/ and I got KeyError above. I have run Step 2. command 4 and combine train.h5 twice. Both of them have KeyError. What should I do to check the problem? I'm really confused. Thanks!

sh example/train.sh

@bzhangGo
Copy link
Owner

This could be checked by inspecting the source_train_file and the resulted train.h5.

Could you please show a few lines in your train file? and also read train.h5 with h5py and check its keys? there might be some mismatch.

@EyjafjalIa
Copy link
Author

After I run command below, I got dev.h5, test.h5, train.h5 and train_(0-9).h5 in sltunet/smkd/features

python main.py --load-weights avg/average.pt --phase features --device 0 --num-feature-aug 10 --work-dir exp/resnet34 --config baseline.yaml

This is my sign_feature_cmb.py file. Should I combine train.h5 and train_(0-9).h5 in a new h5 file or only combine train_(0-9).h5? I guess this line writer = h5py.File('train.h5', 'w') may overwrite train.h5 because I run this python script on the same path of those h5 files and finally I lose some data.

import sys
import glob
import h5py

files = glob.glob(sys.argv[1])
print(files)
writer = h5py.File('train.h5', 'w')

for i, f in enumerate(files):
    reader = h5py.File(f, 'r')
    for key in list(reader.keys()):
        writer.create_dataset("%s_%s" % (key, i), data=reader[key][()])
    reader.close()

writer.close()

@bzhangGo
Copy link
Owner

could you please list some keys from your train.h5? e.g. 5142_8 is missing based on the error, then could you please take a look what keys for 5142 are contained in your training data?

@EyjafjalIa
Copy link
Author

I have solved this error. It happens when I run sign_feature_cmb.py on the same path of train.h5 and train_(0-9).h5. I show my path below.
When the script runs to writer = h5py.File('train.h5', 'w'), it open a file train.h5 with mode write. It may clean train.h5 file if exist on the path of script and write new content.
I change the line to writer = h5py.File('train123.h5', 'w') . After the script finished, I move to right path and rename it to train.h5.
My script path is:

smkd/features
├── dev.h5
├── test.h5
└── train
    ├── sign_feature_cmb.py
    ├── train_0.h5
    ├── train_1.h5
    ├── train_2.h5
    ├── train_3.h5
    ├── train_4.h5
    ├── train_5.h5
    ├── train_6.h5
    ├── train_7.h5
    ├── train_8.h5
    ├── train_9.h5
    └── train.h5

1 directory, 12 files

@EyjafjalIa
Copy link
Author

When I follow the instruction below in sltunet/example, I can't get right combined train.h5 file because that after I run Step 2. extract sign features, I got directory below.

python sign_feature_cmb.py train\*h5 

Directory after extract:

smkd/features
├── dev.h5
├── test.h5
├── train_0.h5
├── train_1.h5
├── train_2.h5
├── train_3.h5
├── train_4.h5
├── train_5.h5
├── train_6.h5
├── train_7.h5
├── train_8.h5
├── train_9.h5
└── train.h5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants