Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting

Liang, Z.; Xu, Y.; Zhai, Y.; Zhu, H.; Xi, J.; Coscia, P.; Genovese, A.

doi:10.1016/j.jii.2025.100926

The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: https://github.com/yikuizhai/DVLT

Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting / Z. Liang, Y. Xu, Y. Zhai, H. Zhu, J. Xi, P. Coscia, A. Genovese. - In: JOURNAL OF INDUSTRIAL INFORMATION INTEGRATION. - ISSN 2452-414X. - 48:(2025 Nov), pp. 100926.1-100926.14. [10.1016/j.jii.2025.100926]

Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting

Z. Liang;Y. Xu;Y. Zhai;H. Zhu;J. Xi;P. Coscia;A. Genovese

2025

Abstract

The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: https://github.com/yikuizhai/DVLT

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Container marking spotting; Text spotting; Vision-Language Models; Transformer; Intelligent logistics;
			
	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				nov-2025
			
	Data ahead of print o data di stampa
	
				20-ago-2025
			
	Rivista in ANCE
	
				JOURNAL OF INDUSTRIAL INFORMATION INTEGRATION
			
	DOI
	
				https://dx.doi.org/10.1016/j.jii.2025.100926
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
jii25.pdf embargo fino al 20/08/2026 Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore) Licenza: Creative commons Dimensione 6.91 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	6.91 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting (final).pdf accesso riservato Tipologia: Publisher's version/PDF Licenza: Nessuna licenza Dimensione 4.4 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	4.4 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1180040

Citazioni

ND

1

1

1

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca