Abstract:
Automatic scene text detection and recognition can benefit a large number of daily life applications such as reading signs and labels, and helping visually impaired persons. Reading scene text images becomes more challenging than reading scanned documents in many aspects due to many factors such as variations of font styles and unpredictable lighting conditions. The problem can be decomposed into two sub-problems: text localization and text recognition. The proposed scene text localization works at the pixel level combined with a new text representation and a fully-convolutional neural network. This method is capable of detecting arbitrary shape texts without language limitations. The experimental results on the standard benchmarks show the performance in terms of accuracy and speed compared to the existing works. The cropped text instances are passed into the proposed text recognition algorithm, which consists of four stages: transformation, feature extraction, sequence modeling, and prediction. The proposed method is designed based on a fully-learnable deep learning-based model in combination with multi-level attention, which inspires from Thai writing system. The training data is purely synthesized from various fonts and novel techniques to make the generated images looked sensible. The experimental results on the test dataset show excellent accuracy and inference time.