Below are the papers and models we read in the research phase. We understand that improving the existing pipeline takes precedence. However, we would like to explore some of the models we have summarized here.
For the primary task, we plan to use MaskOCR. The paper presents a pretraining technique for text recognition using an encoder-decoder transformer architecture. The encoder extracts patch-level representations, and the decoder recognizes the text from the latent representations. The pipeline pretrains both the encoder and decoder in a sequential manner: (i) the encoder is pre-trained in a self-supervised manner on a large set of unlabeled real text images using a masked image modeling approach, and (ii) the decoder is pre-trained over a large set of synthesized text images in a supervised manner with randomly masked text image patches. The experiments performed by the authors show that MaskOCR achieves superior results on benchmark datasets for Chinese and English text images.
We suggest this as an improvement to TrOCR due to the difference in the pretraining step and the gain in performance the step offers. TrOCR uses pre-trained CV and NLP models to initialize the encoder and decoder respectively. Whereas MaskOCR uses the Masked Autoencoder architechture (MAE) in the pretaining step. MAE has shown results in effective pretraining of Vision Transformer models. MaskOCR presented an average accuracy of 93.8% on English scene text datasets like the IC13, SVT, IIIT5K, IC15, SVTP and CUTE.
Note: The codebase for MaskOCR is not available yet but we will attempt to implement it.
For the Secondary task, we have three sub-tasks which are text detection, masking of the text and then train and test the processed images on a classification model. For text detection we propose to use TextFuseNet as it is currently giving out the best results for text detection even in complex scenarios where the text can be in different shapes as well. This proposed framework uses multi level detection to give out the best results and box the areas that contains text as it has achieved an F-measure of 94.3% on ICDAR2013. Once we formed the boxes around the text regions then we can use techniques such as masking and inpainiting and make it as if the text never existed.
Now, we plan to use an image classification model such as ResNet-34 or ViT-G/14 and what we propose is that we implement and use Basic-L as the optimization algorithm as it is both more memory efficient as it only keeps track of the momentum and also give out better accuracies than popularly used optimizers such as Adam, Adafactor.This optimizer on vision language contrastive learning achieved 88.35 zero-shot and 91.1% fine-tuning accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x.
Another model we read about, Contrastive Captioner (CoCa), is an image-text encoder-decoder model. In conrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.
CoCa is pretrained from scratch in a single stage on both web- scale alt-text data and annotated images by treating all labels simply as texts. CoCa encodes images to latent representations by a neural network encoder, for example, vision transformer (ViT) (used by default; it can also be other image encoders like ConvNets), and decodes texts with a causal masking transformer decoder.
The pretrained frozen CoCa model is applied on image classification with an attentional pooling learned together with a softmax cross-entropy loss layer on top of the embedding outputs from CoCa encoder. A learning rate of 5 × 10−4 is set on both attentional pooler and softmax, batch size of 128, and a cosine learning rate schedule. CoCa is also finetuned on image datasets individually with a smaller learning rate of 1 × 10−4, resulting in new state-of-the-art 91.0% Top-1 accuracy on ImageNet. CoCa models use much less parameters than other methods in the visual encoder, suggesting the proposed framework efficiently combines text training signals and is able to learn high-quality visual representation better than the classical single-encoder approach.
For the tertiary task of instance segmentation, based on the prior work using Mask R-CNN, we base our newly proposed method upon the results from the COCO test dataset. We propose the EVA vision transformer as their results currently perform the best, as shown in the paperwithcode leaderboard. EVA has a mask AP of 55.5, while Mask R-CNN has a mask AP of 37.1 on the COCO test development set. EVA utilizes a vanilla vision transformer; it uses the shape of Vision Transformer, the vision encoder of BEiT-3, and utilizes the CLIP-L/14 vision tower for pretraining. Another popular model that could be adapted to this problem is the Swin transformer; this model utilizes less GPU memory but still performs relatively well, with the mask AP being 54.4 for the COCO test development set. The Swin transformer utilizes the vision transformer architecture and uses a few novel approaches, like a residual post-normalization technique and a scaled cosine attention approach to assist with the stability of large vision models.