We present a system that integrates diverse technologies to achieve real-time, distributed audio surveillance. The system employs a network of microphones mounted on ESP32 platforms, which transmit compressed audio chunks via an MQTT protocol to Raspberry Pi5 devices for acoustic classification. These devices host an audio transformer model trained on the AudioSet dataset, enabling the real-time classification and timestamping of audio events with high accuracy. The output of the transformer is kept in a database of events and is subsequently converted into JSON format. The latter is further parsed into a graph structure that encapsulates the annotated soundscape, providing a rich and dynamic representation of audio environments. These graphs are subsequently traversed and analyzed using dedicated Python code and large language models (LLMs), enabling the system to answer complex queries about the nature, relationships, and context of detected audio events. We introduce a novel graph parsing method that achieves low false-alarm rates. In the task of analyzing the audio from a 1 h and 40 min long movie featuring hazardous driving practices, our approach achieved an accuracy of 0.882, precision of 0.8, recall of 1.0, and an F1 score of 0.89. By combining the robustness of distributed sensing and the precision of transformer-based audio classification, our approach that treats audio as text paves the way for advanced applications in acoustic surveillance, environmental monitoring, and beyond.
Real-time acoustic detection of critical incidents in smart cities using artificial intelligence and edge networks / I. Saradopoulos, I. Potamitis, S. Ntalampiras, I. Rigakis, C. Manifavas, A. Konstantaras. - In: SENSORS. - ISSN 1424-8220. - 25:8(2025 Apr 20), pp. 2597.1-2597.24. [10.3390/s25082597]
Real-time acoustic detection of critical incidents in smart cities using artificial intelligence and edge networks
S. Ntalampiras;
2025
Abstract
We present a system that integrates diverse technologies to achieve real-time, distributed audio surveillance. The system employs a network of microphones mounted on ESP32 platforms, which transmit compressed audio chunks via an MQTT protocol to Raspberry Pi5 devices for acoustic classification. These devices host an audio transformer model trained on the AudioSet dataset, enabling the real-time classification and timestamping of audio events with high accuracy. The output of the transformer is kept in a database of events and is subsequently converted into JSON format. The latter is further parsed into a graph structure that encapsulates the annotated soundscape, providing a rich and dynamic representation of audio environments. These graphs are subsequently traversed and analyzed using dedicated Python code and large language models (LLMs), enabling the system to answer complex queries about the nature, relationships, and context of detected audio events. We introduce a novel graph parsing method that achieves low false-alarm rates. In the task of analyzing the audio from a 1 h and 40 min long movie featuring hazardous driving practices, our approach achieved an accuracy of 0.882, precision of 0.8, recall of 1.0, and an F1 score of 0.89. By combining the robustness of distributed sensing and the precision of transformer-based audio classification, our approach that treats audio as text paves the way for advanced applications in acoustic surveillance, environmental monitoring, and beyond.File | Dimensione | Formato | |
---|---|---|---|
sensors-25-02597.pdf
accesso aperto
Tipologia:
Publisher's version/PDF
Dimensione
2.94 MB
Formato
Adobe PDF
|
2.94 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.