This article is regarding our recent paper in WACV 2021 on Neural Network compression.
Networks such as Recurrent Neural Networks (RNNs) and its advanced variant Long short term (LSTM) are specialized for processing sequential data such as text, spoken words and videos. But these networks have large number of parameters and incur a significant amount of inference time. The number of hidden states in these networks is a hyper-parameter and the usually chosen number (256 or 512 or 1028) is often much larger than required for accurate prediction and leads to over-parameterization. In the task of action recognition from videos, the input video frame usually consisting of RGB stacked colored frames forms a high dimensional input. Hence for RNNs, the dimension of the input becomes high making the input-to-hidden matrix extremely large. For example, a video from the UCF11 dataset has RGB frames of dimensions 160×120 pixels each. Thus the total input size must be 160x120x3= 57,000. Even with a relatively small hidden state size of 256, the total parameters required in a single layer LSTM model is 58.9 Million. The naïve over-parameterized one layer end-to-end LSTM model is seen to overfit with an accuracy of 67.7% on UCF11 dataset .
Several tensor decomposition methods [3,4,5] have been applied to RNNs to replace the standard input-to-hidden matrix with a low rank structure. These methods modify the input and model the input-to-hidden matrix to retain dense weight matrices of lower rank. However, most of the methods of compressing RNNs do not compress the size of the hidden-to-hidden matrix. A simple action recognition dataset, the UCF11 has lower number of classes than other larger datasets like UCF101, with low variations in the dataset. This prompts the need for only a few hidden states relevant to correct prediction of actions to model data representations. The Variational Information Bottleneck(VIB) theory introduced by Tishby et.al  introduce the idea of retaining only the relevant intermediate data representations in neural networks while conserving the accuracy of prediction.
We adapt this idea to the complex variant of RNNs, the LSTM network to remove redundant input features and hidden states and thus reduce the…
Continue reading: https://towardsdatascience.com/a-variational-information-bottleneck-vib-based-method-to-compress-sequential-networks-for-human-b559d3a50e30?source=rss—-7f60cf5620c9—4