Background/Objectives: Deep learning (DL) has shown strong potential in analyzing food images, but few studies have directly predicted mass, energy, and macronutrient content from images. In addition to the importance of high-quality data, differences in country-specific food composition databases (FCDBs) can hinder model generalization. Methods: We assessed the performance of several standard DL models using four ground truth datasets derived from Nutrition5k—the largest image–nutrition dataset with ~5000 complex US cafeteria dishes. In light of developing an Italian dietary assessment tool, these datasets varied by FCDB alignment (Italian vs. US) and data curation (ingredient–mass correction and frame filtering on the test set). We evaluated combinations of four feature extractors [ResNet-50 (R50), ResNet-101 (R101), InceptionV3 (IncV3), and Vision Transformer-B-16 (ViT-B-16)] with two regression networks (2+1 and 2+2), using IncV3_2+2 as the benchmark. Descriptive statistics (percentages of agreement, unweighted Cohen’s kappa, and Bland–Altman plots) and standard regression metrics were used to compare predicted and ground truth nutritional composition. Dishes mispredicted by ≥7 algorithms were analyzed separately. Results: R50, R101, and ViT-B-16 consistently outperformed the benchmark across all datasets. Specifically, when replacing it with these top algorithms, reductions in median Mean Absolute Percentage Errors were 6.2% for mass, 6.4% for energy, 12.3% for fat, and 33.1% and 40.2% for protein and carbohydrates. Ingredient–mass correction substantially improved prediction metrics (6–42% when considering the top algorithms), while frame filtering had a more limited effect (<3%). Performance was consistently poor across most models for complex salads, chicken-based or eggs-based dishes, and Western-inspired breakfasts. Conclusions: The R101 and ViT-B-16 architectures will be prioritized in future analyses, where ingredient–mass correction and automated frame filtering methods will be considered.

2D Prediction of the Nutritional Composition of Dishes from Food Images: Deep Learning Algorithm Selection and Data Curation Beyond the Nutrition5k Project / R. Bianco, S. Coluccia, M. Marinoni, A. Falcon, F. Fiori, G. Serra, M. Ferraroni, V. Edefonti, M. Parpinel. - In: NUTRIENTS. - ISSN 2072-6643. - 17:13(2025 Jul), pp. 2196.1-2196.23. [10.3390/nu17132196]

2D Prediction of the Nutritional Composition of Dishes from Food Images: Deep Learning Algorithm Selection and Data Curation Beyond the Nutrition5k Project

S. Coluccia;M. Marinoni;F. Fiori;G. Serra;M. Ferraroni;V. Edefonti
Co-ultimo
;
2025

Abstract

Background/Objectives: Deep learning (DL) has shown strong potential in analyzing food images, but few studies have directly predicted mass, energy, and macronutrient content from images. In addition to the importance of high-quality data, differences in country-specific food composition databases (FCDBs) can hinder model generalization. Methods: We assessed the performance of several standard DL models using four ground truth datasets derived from Nutrition5k—the largest image–nutrition dataset with ~5000 complex US cafeteria dishes. In light of developing an Italian dietary assessment tool, these datasets varied by FCDB alignment (Italian vs. US) and data curation (ingredient–mass correction and frame filtering on the test set). We evaluated combinations of four feature extractors [ResNet-50 (R50), ResNet-101 (R101), InceptionV3 (IncV3), and Vision Transformer-B-16 (ViT-B-16)] with two regression networks (2+1 and 2+2), using IncV3_2+2 as the benchmark. Descriptive statistics (percentages of agreement, unweighted Cohen’s kappa, and Bland–Altman plots) and standard regression metrics were used to compare predicted and ground truth nutritional composition. Dishes mispredicted by ≥7 algorithms were analyzed separately. Results: R50, R101, and ViT-B-16 consistently outperformed the benchmark across all datasets. Specifically, when replacing it with these top algorithms, reductions in median Mean Absolute Percentage Errors were 6.2% for mass, 6.4% for energy, 12.3% for fat, and 33.1% and 40.2% for protein and carbohydrates. Ingredient–mass correction substantially improved prediction metrics (6–42% when considering the top algorithms), while frame filtering had a more limited effect (<3%). Performance was consistently poor across most models for complex salads, chicken-based or eggs-based dishes, and Western-inspired breakfasts. Conclusions: The R101 and ViT-B-16 architectures will be prioritized in future analyses, where ingredient–mass correction and automated frame filtering methods will be considered.
2D prediction of nutritional composition; Nutrition5k; deep learning; dietary assessment tools; energy prediction; food image recognition; frame filtering; macronutrients prediction; mass prediction; portion size correction
Settore MEDS-24/A - Statistica medica
   INDACO: Incorporating Nonadditivity and nonlinearity within the Dietary patterns And Cancer risk association: statistics and machine learning to create novel research Opportunities from dietary assessment to cancer prediction
   INDACO
   MINISTERO DELL'UNIVERSITA' E DELLA RICERCA
   20227YCB5P_001
lug-2025
30-giu-2027
Article (author)
File in questo prodotto:
File Dimensione Formato  
nutrients-17-02196-v2.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Licenza: Creative commons
Dimensione 8.9 MB
Formato Adobe PDF
8.9 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1217256
Citazioni
  • ???jsp.display-item.citation.pmc??? 3
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
  • OpenAlex 2
social impact