IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

We introduce a novel approach to source code representation to be used in combination with neural networks. Such a representation is designed to permit the production of a continuous vector for each code statement. In particular, we present how the representation is produced in the case of Java source code. We test our representation for three tasks: code summarization, statement separation, and code search. We compare with the state-of-the-art non-autoregressive and end-to-end models for these tasks. We conclude that all tasks benefit from the proposed representation to boost their performance in terms of f1-score, accuracy, and MRR, respectively. Moreover, we show how models trained on code summarization and models trained on statement separation can be combined to address methods with tangled responsibilities. Meaning that these models can be used to detect code misconduct.

Fold2Vec: Towards a Statement Based Representation of Code for Code Comprehension / F. Bertolotti, W. Cazzola. - In: ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY. - ISSN 1049-331X. - 2022:(2022), pp. 1-26. [Epub ahead of print] [10.1145/3514232]

Fold2Vec: Towards a Statement Based Representation of Code for Code Comprehension

F. Bertolotti^Primo;W. Cazzola^Ultimo

2022

Abstract

We introduce a novel approach to source code representation to be used in combination with neural networks. Such a representation is designed to permit the production of a continuous vector for each code statement. In particular, we present how the representation is produced in the case of Java source code. We test our representation for three tasks: code summarization, statement separation, and code search. We compare with the state-of-the-art non-autoregressive and end-to-end models for these tasks. We conclude that all tasks benefit from the proposed representation to boost their performance in terms of f1-score, accuracy, and MRR, respectively. Moreover, we show how models trained on code summarization and models trained on statement separation can be combined to address methods with tangled responsibilities. Meaning that these models can be used to detect code misconduct.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
			Machine Learninig; Neural Networks; Big Code; Learning Representations; Method Name Suggestion; Intent identiication
		
	Settori scientifico-disciplinari dell'articolo
	
			Settore INF/01 - Informatica
		
	Data di pubblicazione
	
			2022
		
	Data ahead of print o data di stampa
	
			6-apr-2022
		
	Rivista in ANCE
	
			ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY
		
	DOI
	
			https://dx.doi.org/10.1145/3514232
		
	Tipologia
	
			Article (author)

File in questo prodotto:

File	Dimensione	Formato
3514232.pdf accesso aperto Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore) Dimensione 521.29 kB Formato Adobe PDF Visualizza/Apri	521.29 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/922076

Citazioni

ND

3

1

social impact