Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR: clarification about input and output #20

Open
mrgloom opened this issue Aug 31, 2017 · 3 comments
Open

OCR: clarification about input and output #20

mrgloom opened this issue Aug 31, 2017 · 3 comments

Comments

@mrgloom
Copy link

mrgloom commented Aug 31, 2017

I'm trying to solve OCR tasks based on this code.

So what shape input to LSTM should have, suppose we have images [batch_size, height, width, channels] how should they be reshaped to be used as input? Like [batch_size, width, height*channels], so width is like time dimension?

What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or batch_size should be 1)

What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be [batch_size, feature_map_height, feature_map_width, feature_map_channels], how should blob be reshaped to be used as input to LSTM? Like [batch_size, feature_map_width, feature_map_height*feature_map_channels] ? Can we reshape it just to single row like [batch_size, feature_map_width*feature_map_height*feature_map_channels] it will be like sequence of pixels and we loose some spartial information, will it work?

Here is definition of input, but I'm not sure what it's mean in your case [batch_size, max_stepsize, num_features]:
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90

And how output of LSTM depends on input size and max sequence length?
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110

BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples.
https://github.com/mrgloom/Char-sequence-recognition

@mrgloom
Copy link
Author

mrgloom commented Aug 31, 2017

@mrgloom
Copy link
Author

mrgloom commented Sep 1, 2017

Some info described here, but it still not very clear for me:

https://stackoverflow.com/questions/38059247/using-tensorflows-connectionist-temporal-classification-ctc-implementation

So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function.

@igormq
Copy link
Owner

igormq commented Mar 26, 2018

Hi @mrgloom , you can use either width or height as your ''time dimension''. Using the width you will perform a row-wise scan, otherwise you will perform a column-wise scan. Also, you can apply conv layers before the LSTM network followed by a Global Average Pooling, returning a tensor with shape [batch_size, feature_map_height, feature_map_width]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants