What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.
It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the later case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth. For non-smooth activations such as ReLU, SELU and ELU, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. For smooth activations such as tanh, swish and polynomial, the situation is more complex. If the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. We discuss a number of extensions and applications of these results.
We present an unsupervised method to generate Word2Sense word embeddings that are interpretable — each dimension of the embedding space corresponds to a fine-grained sense, and the non-negative value of the embedding along the j-th dimension represents the relevance of the j-th sense to the word. The underlying LDA-based generative model can be extended to refine the representation of a polysemous word in a short context, allowing us to use the embeddings in contextual tasks. On computational NLP tasks, Word2Sense embeddings compare well with other word embeddings generated by unsupervised methods. Across tasks such as word similarity, entailment, sense induction, and contextual interpretation, Word2Sense is competitive with the state-of-the-art method for that task. Word2Sense embeddings are at least as sparse and fast to compute as prior art.
In this project, we aimed to solve the multi manifold problem of GANs that use IPM metrics as loss function. One of my approaches was to learn tangent space at each point by local PCA and match them using the fact that points in generated manifold and original manifold that are closer should have similar tangent planes. Another approach was based on boosting to increase the weights of generated points not present in original manifold and construct a weighted MMD formulation using those weights. In low dimensional data with multiple independent clusters, IPM GANs give interconnected clusters as output, while weighted MMD has been successful to separate them.
In this paper, we develop a content-cum-user based deep learning framework DeepTagRec to recommend appropriate question tags on Stack Overflow. The proposed system learns the content representation from question title and body. Subsequently, the learnt representation from heterogeneous relationship between user and tags is fused with the content representation for the final tag prediction. On a very large-scale dataset comprising half a million question posts, DeepTagRec beats all the baselines; in particular, it significantly outperforms the best performing baseline TagCombine achieving an overall gain of 60.8% and 36.8% in precision@3 and recall@10 respectively. DeepTagRec also achieves 63% and 33.14% maximum improvement in exact-k accuracy and top-k accuracy respectively over TagCombine.
We conduct mathematical analysis on the effect of batch normalization (BN) on gradient backpropogation in residual network training, which is believed to play a critical role in addressing the gradient vanishing/explosion problem, in this work. By analyzing the mean and variance behavior of the input and the gradient in the forward and backward passes through the BN and residual branches, respectively, we show that they work together to confine the gradient variance to a certain range across residual blocks in backpropagation. As a result, the gradient vanishing/explosion problem is avoided. We also show the relative importance of batch normalization wrt the residual branches in residual networks.
We studied the effect of illumination on a given 3D voxel set and used ray casting to construct an image of the voxel set on the image plane, given a viewpoint and one or many point sources’ details.
A book’s success/popularity depends on various parameters: extrinsic and intrinsic. In this paper, we study how the book reading characteristics might influence the popularity of a book. Towards this objective, we perform a cross-platform study of Goodreads entities and attempt to establish the connection between various Goodreads entities and the popular books (“Amazon best sellers”). We analyze the collective reading behavior on Goodreads platform and quantify various characteristic features of the Goodreads entities to identify differences between these Amazon best sellers (ABS) and the other non-best-selling books. We then develop a prediction model using the characteristic features to predict if a book shall become a best seller after 1 month (15 days) since its publication. On a balanced set, we are able to achieve a very high average accuracy of 88.72% (85.66%) for the prediction where the other competitive class contains books which are randomly selected from the Goodreads dataset. Our method primarily based on features derived from user posts and genre-related characteristic properties achieves an improvement of 16.4% over the traditional popularity factor (ratings, reviews)-based baseline methods. We also evaluate our model with two more competitive sets of books (a) that are both highly rated and have received a large number of reviews (but are not best sellers) (HRHR) and (b) Goodreads Choice Awards Nominated books which are non-best sellers (GCAN). We are able to achieve quite good results with very high average accuracy of 87.1% as well as high ROC for ABS vs GCAN. For ABS vs HRHR, our model yields a high average accuracy of 86.22%.
I was a part of this project, that has been assigned to IIT Kharagpur and is funded by Ministry of Human Resource Development, India. I developed a Web-Service for extracting file links for Institutional Digital Repositories (IDRs).