Developing an intelligent system, capable of learning discriminative high-level features from high dimensional data lies at the core of solving many computer vision (CV ) and machine learning (ML) tasks. Scene or human action recognition from videos is an important topic in CV and ML. Its applications include video surveillance, robotics, human-computer interaction, video retrieval, etc. Several bio inspired hand crafted feature extraction systems have been proposed for processing temporal data. However, recent deep learning techniques have dominated CV and ML by their good performance on large scale datasets. One of the most widely used deep learning technique is Convolutional neural network (CNN) or its variations, e.g. ConvNet, 3DCNN, C3D. CNN kernel scheme reduces the number of parameters with respect to fully connected Neural Networks. Recent deep CNNs have more layers and more kernels for each layer with respect to early CNNs, and as a consequence, they result in a large number of parameters. In addition, they violate the pyramidal plausible architecture of biological neural network due to the increasing number of filters at each higher layer resulting in difficulty for convergence at training step. In this dissertation, we address three main questions central to pyramidal structure and deep neural networks: 1) Is it worth to utilize pyramidal architecture for proposing a generalized recognition system? 2) How to enhance pyramidal neural network (PyraNet) for recognizing action and dynamic scenes in the videos? 3) What will be the impact of imposing pyramidal structure on a deep CNN? In the first part of the thesis, we provide a brief review of the work done for action and dynamic scene recognition using traditional computer vision and machine learning approaches. In addition, we give a historical and present overview of pyramidal neural networks and how deep learning emerged. In the second part, we introduce a strictly pyramidal deep architecture for dynamic scene and human action recognition. It is based on the 3DCNN model and the image pyramid concept. We introduce a new 3D weighting scheme that presents a simple connection scheme with lower computational and memory costs and results in less number of learnable parameters compared to other neural networks. 3DPyraNet extracts features from both spatial and temporal dimensions by keeping biological structure, thereby it is capable to capture the motion information encoded in multiple adjacent frames. 3DPyraNet model is extended with three modifications: 1) changing input image size; 2) changing receptive field and overlap size in correlation layers; and 3) adding a linear classifier at the end to classify the learned features. It results in a discriminative approach for spatiotemporal feature learning in action and dynamic scene recognition. In combination with a linear SVM classifier, our model outperforms state-of-the-art methods in one-vs-all accuracy on three video benchmark datasets (KTH, Weizmann, and Maryland). Whereas, it gives competitive accuracy on a 4th dataset (YUPENN). In the last part of our thesis, we investigate to what extent CNN may take advantage of pyramid structure typical of biological neurons. A generalized statement over convolutional layers from input up-to fully connected layer is introduced that further helps in understanding and designing a successful deep network. It reduces ambiguity, number of parameters, and their size on disk without degrading overall accuracy. It also helps in giving a generalize guideline for modeling a deep architecture by keeping certain ratio of filters in starting layers vs. other deeper layers. Competitive results are achieved compared to similar well-engineered deeper architectures on four benchmark datasets. The same approach is further applied on person re-identification. Less ambiguity in features increase Rank-1 performance and results in better or comparable results to the state-of-the-art deep models.
A PYRAMIDAL APPROACH FOR DESIGNING DEEP NEURAL NETWORK ARCHITECTURES / I. Ullah ; relatore: A. Petrosino ; coordinatore: P. Boldi. UNIVERSITA' DEGLI STUDI DI MILANO, 2017 Feb 27. 28. ciclo, Anno Accademico 2015. [10.13130/ullah-ihsan_phd2017-02-27].
A PYRAMIDAL APPROACH FOR DESIGNING DEEP NEURAL NETWORK ARCHITECTURES
I. Ullah
2017
Abstract
Developing an intelligent system, capable of learning discriminative high-level features from high dimensional data lies at the core of solving many computer vision (CV ) and machine learning (ML) tasks. Scene or human action recognition from videos is an important topic in CV and ML. Its applications include video surveillance, robotics, human-computer interaction, video retrieval, etc. Several bio inspired hand crafted feature extraction systems have been proposed for processing temporal data. However, recent deep learning techniques have dominated CV and ML by their good performance on large scale datasets. One of the most widely used deep learning technique is Convolutional neural network (CNN) or its variations, e.g. ConvNet, 3DCNN, C3D. CNN kernel scheme reduces the number of parameters with respect to fully connected Neural Networks. Recent deep CNNs have more layers and more kernels for each layer with respect to early CNNs, and as a consequence, they result in a large number of parameters. In addition, they violate the pyramidal plausible architecture of biological neural network due to the increasing number of filters at each higher layer resulting in difficulty for convergence at training step. In this dissertation, we address three main questions central to pyramidal structure and deep neural networks: 1) Is it worth to utilize pyramidal architecture for proposing a generalized recognition system? 2) How to enhance pyramidal neural network (PyraNet) for recognizing action and dynamic scenes in the videos? 3) What will be the impact of imposing pyramidal structure on a deep CNN? In the first part of the thesis, we provide a brief review of the work done for action and dynamic scene recognition using traditional computer vision and machine learning approaches. In addition, we give a historical and present overview of pyramidal neural networks and how deep learning emerged. In the second part, we introduce a strictly pyramidal deep architecture for dynamic scene and human action recognition. It is based on the 3DCNN model and the image pyramid concept. We introduce a new 3D weighting scheme that presents a simple connection scheme with lower computational and memory costs and results in less number of learnable parameters compared to other neural networks. 3DPyraNet extracts features from both spatial and temporal dimensions by keeping biological structure, thereby it is capable to capture the motion information encoded in multiple adjacent frames. 3DPyraNet model is extended with three modifications: 1) changing input image size; 2) changing receptive field and overlap size in correlation layers; and 3) adding a linear classifier at the end to classify the learned features. It results in a discriminative approach for spatiotemporal feature learning in action and dynamic scene recognition. In combination with a linear SVM classifier, our model outperforms state-of-the-art methods in one-vs-all accuracy on three video benchmark datasets (KTH, Weizmann, and Maryland). Whereas, it gives competitive accuracy on a 4th dataset (YUPENN). In the last part of our thesis, we investigate to what extent CNN may take advantage of pyramid structure typical of biological neurons. A generalized statement over convolutional layers from input up-to fully connected layer is introduced that further helps in understanding and designing a successful deep network. It reduces ambiguity, number of parameters, and their size on disk without degrading overall accuracy. It also helps in giving a generalize guideline for modeling a deep architecture by keeping certain ratio of filters in starting layers vs. other deeper layers. Competitive results are achieved compared to similar well-engineered deeper architectures on four benchmark datasets. The same approach is further applied on person re-identification. Less ambiguity in features increase Rank-1 performance and results in better or comparable results to the state-of-the-art deep models.File | Dimensione | Formato | |
---|---|---|---|
phd_unimi_R10292.pdf
accesso aperto
Descrizione: Complete PhD_Thesis_UNIMI_R10292
Tipologia:
Tesi di dottorato completa
Dimensione
4.61 MB
Formato
Adobe PDF
|
4.61 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.