OCR: clarification about input and output #20

mrgloom · 2017-08-31T12:23:41Z

I'm trying to solve OCR tasks based on this code.

So what shape input to LSTM should have, suppose we have images [batch_size, height, width, channels] how should they be reshaped to be used as input? Like [batch_size, width, height*channels], so width is like time dimension?

What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or batch_size should be 1)

What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be [batch_size, feature_map_height, feature_map_width, feature_map_channels], how should blob be reshaped to be used as input to LSTM? Like [batch_size, feature_map_width, feature_map_height*feature_map_channels] ? Can we reshape it just to single row like [batch_size, feature_map_width*feature_map_height*feature_map_channels] it will be like sequence of pixels and we loose some spartial information, will it work?

Here is definition of input, but I'm not sure what it's mean in your case [batch_size, max_stepsize, num_features]:
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90

And how output of LSTM depends on input size and max sequence length?
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110

BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples.
https://github.com/mrgloom/Char-sequence-recognition

The text was updated successfully, but these errors were encountered:

mrgloom · 2017-08-31T14:38:29Z

Seems related:
https://gist.github.com/igormq/eff5b2196a52e89c61ea52515ed87c47

mrgloom · 2017-09-01T13:08:12Z

Some info described here, but it still not very clear for me:

https://stackoverflow.com/questions/38059247/using-tensorflows-connectionist-temporal-classification-ctc-implementation

So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function.

igormq · 2018-03-26T22:15:39Z

Hi @mrgloom , you can use either width or height as your ''time dimension''. Using the width you will perform a row-wise scan, otherwise you will perform a column-wise scan. Also, you can apply conv layers before the LSTM network followed by a Global Average Pooling, returning a tensor with shape [batch_size, feature_map_height, feature_map_width]

mrgloom mentioned this issue Sep 1, 2017

Training images jonrein/tensorflow_CTC_example#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR: clarification about input and output #20

OCR: clarification about input and output #20

mrgloom commented Aug 31, 2017

mrgloom commented Aug 31, 2017 •

edited

Loading

mrgloom commented Sep 1, 2017

igormq commented Mar 26, 2018 •

edited

Loading

OCR: clarification about input and output #20

OCR: clarification about input and output #20

Comments

mrgloom commented Aug 31, 2017

mrgloom commented Aug 31, 2017 • edited Loading

mrgloom commented Sep 1, 2017

igormq commented Mar 26, 2018 • edited Loading

mrgloom commented Aug 31, 2017 •

edited

Loading

igormq commented Mar 26, 2018 •

edited

Loading