The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: https://github.com/yikuizhai/DVLT
Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting / Z. Liang, Y. Xu, Y. Zhai, H. Zhu, J. Xi, P. Coscia, A. Genovese. - In: JOURNAL OF INDUSTRIAL INFORMATION INTEGRATION. - ISSN 2452-414X. - 48:(2025 Nov), pp. 100926.1-100926.14. [10.1016/j.jii.2025.100926]
Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting
P. Coscia;A. Genovese
2025
Abstract
The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: https://github.com/yikuizhai/DVLT| File | Dimensione | Formato | |
|---|---|---|---|
|
jii25.pdf
embargo fino al 20/08/2026
Tipologia:
Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Licenza:
Creative commons
Dimensione
6.91 MB
Formato
Adobe PDF
|
6.91 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
|
Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting (final).pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Licenza:
Nessuna licenza
Dimensione
4.4 MB
Formato
Adobe PDF
|
4.4 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




