# Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

## Misc

This is a page not in th emain menu

## Analysis of Newton’s Method

Posted on:

In optimization, Netwon’s method is used to find the roots of the derivative of a twice differentiable function given the oracle access to its gradient and hessian. By having super-liner memory in the dimension of the ambient space, Newton’s method can take the advantage of the second order curvature and optimize the objective function at a quadratically convergent rate. Here I consider the case when the objective function is smooth and strongly convex.

## Deriving the Fokker-Planck equation

Posted on:

In the theory of dynamic systems, Fokker-Planck equation is used to describe the time evolution of the probability density function. It is a partial differential equation that describes how the density of a stochastic process changes as a function of time under the influence of a potential field. Some common application of it are in the study of Brownian motion, Ornstein–Uhlenbeck process, and in statistical physics. The motivation behind understanding the derivation is to study Levy flight processes that has caught my recent attention.

## SGD without replacement

Posted on:

This article is in continuation of my previous blog, and discusses about the work by Prateek Jain, Dheeraj Nagaraj and Praneeth Netrapalli 2019. The authors provide tight rates for SGD without replacement for general smooth, and general smooth and strongly convex functions using the method of exchangeable pairs to bound Wasserstein distances, and techniques from optimal transport.

## Non-asymptotic rate for Random Shuffling for Quadratic functions

Posted on:

This article is in continuation of my previous blog, and discusses about a section of the work by Jeffery Z. HaoChen and Suvrit Sra 2018, in which the authors come up with a non-asymptotic rate of $\mathcal{O}\left(\frac{1}{T^2} + \frac{n^3}{T^3} \right)$ for Random Shuffling Stochastic algorithm which is strictly better than that of SGD.

## Bias-Variance Trade-offs for Averaged SGD in Least Mean Squares

Posted on:

This article is on the work by Défossez and Bach 2014, in which the authors develop an operator view point for analyzing Averaged SGD updates to show the Bias-Variance Trade-off and provide tight convergence rates of Least Mean Squared problem.

Posted on:

## Nesterov’s Acceleration

Posted on:

This post contains an error vector analysis of the Nesterov’s accelerated gradient descent method and some insightful implications that can be derived from it.

Posted on:

With a number of courses, books and reading material out there here is a list of some which I personally find useful for building a fundamental understanding in Machine Learning.

## A survey on Large Scale Optimization

Posted on:

This post contains a summary and survey of the theoretical understandings of Large Scale Optimization by referring some talks, papers, and lectures that I have come across in the recent.

## National Digital Library project

Posted on:

I was a part of this project, that has been assigned to IIT Kharagpur and is funded by Ministry of Human Resource Development, India. I developed a Web-Service for extracting file links for Institutional Digital Repositories (IDRs).

## Analyzing Social Book Reading Behavior on Goodreads and How It Predicts Amazon Best Sellers

Posted on:

A book’s success/popularity depends on various parameters: extrinsic and intrinsic. In this paper, we study how the book reading characteristics might influence the popularity of a book. Towards this objective, we perform a cross-platform study of Goodreads entities and attempt to establish the connection between various Goodreads entities and the popular books (“Amazon best sellers”). We analyze the collective reading behavior on Goodreads platform and quantify various characteristic features of the Goodreads entities to identify differences between these Amazon best sellers (ABS) and the other non-best-selling books. We then develop a prediction model using the characteristic features to predict if a book shall become a best seller after 1 month (15 days) since its publication. On a balanced set, we are able to achieve a very high average accuracy of 88.72% (85.66%) for the prediction where the other competitive class contains books which are randomly selected from the Goodreads dataset. Our method primarily based on features derived from user posts and genre-related characteristic properties achieves an improvement of 16.4% over the traditional popularity factor (ratings, reviews)-based baseline methods. We also evaluate our model with two more competitive sets of books (a) that are both highly rated and have received a large number of reviews (but are not best sellers) (HRHR) and (b) Goodreads Choice Awards Nominated books which are non-best sellers (GCAN). We are able to achieve quite good results with very high average accuracy of 87.1% as well as high ROC for ABS vs GCAN. For ABS vs HRHR, our model yields a high average accuracy of 86.22%.

## Voxel Illumination and Ray Casting

Posted on:

We studied the effect of illumination on a given 3D voxel set and used ray casting to construct an image of the voxel set on the image plane, given a viewpoint and one or many point sources’ details.

## Analysis on Gradient Propagation in Batch Normalized Residual Networks

Posted on:

We conduct mathematical analysis on the effect of batch normalization (BN) on gradient backpropogation in residual network training, which is believed to play a critical role in addressing the gradient vanishing/explosion problem, in this work. By analyzing the mean and variance behavior of the input and the gradient in the forward and backward passes through the BN and residual branches, respectively, we show that they work together to confine the gradient variance to a certain range across residual blocks in backpropagation. As a result, the gradient vanishing/explosion problem is avoided. We also show the relative importance of batch normalization wrt the residual branches in residual networks.

## DeepTagRec: A Content-cum-User Based Tag Recommendation Framework for Stack Overflow

Posted on:

In this paper, we develop a content-cum-user based deep learning framework DeepTagRec to recommend appropriate question tags on Stack Overflow. The proposed system learns the content representation from question title and body. Subsequently, the learnt representation from heterogeneous relationship between user and tags is fused with the content representation for the final tag prediction. On a very large-scale dataset comprising half a million question posts, DeepTagRec beats all the baselines; in particular, it significantly outperforms the best performing baseline TagCombine achieving an overall gain of 60.8% and 36.8% in precision@3 and recall@10 respectively. DeepTagRec also achieves 63% and 33.14% maximum improvement in exact-k accuracy and top-k accuracy respectively over TagCombine.

## Regularization of GANs (BTech Thesis)

Posted on:

In this project, we aimed to solve the multi manifold problem of GANs that use IPM metrics as loss function. One of my approaches was to learn tangent space at each point by local PCA and match them using the fact that points in generated manifold and original manifold that are closer should have similar tangent planes. Another approach was based on boosting to increase the weights of generated points not present in original manifold and construct a weighted MMD formulation using those weights. In low dimensional data with multiple independent clusters, IPM GANs give interconnected clusters as output, while weighted MMD has been successful to separate them.

## Word2Sense:Sparse Iterpretable Word Embeddings

Posted on:

We present an unsupervised method to generate Word2Sense word embeddings that are interpretable — each dimension of the embedding space corresponds to a fine-grained sense, and the non-negative value of the embedding along the j-th dimension represents the relevance of the j-th sense to the word. The underlying LDA-based generative model can be extended to refine the representation of a polysemous word in a short context, allowing us to use the embeddings in contextual tasks. On computational NLP tasks, Word2Sense embeddings compare well with other word embeddings generated by unsupervised methods. Across tasks such as word similarity, entailment, sense induction, and contextual interpretation, Word2Sense is competitive with the state-of-the-art method for that task. Word2Sense embeddings are at least as sparse and fast to compute as prior art.

## Effect of Activation Functions on the Training of Overparametrized Neural Nets

Posted on:

It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the later case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth. For non-smooth activations such as ReLU, SELU and ELU, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. For smooth activations such as tanh, swish and polynomial, the situation is more complex. If the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. We discuss a number of extensions and applications of these results.

## Non-Gaussianity of Stochastic Gradient Noise (Ongoing)

Posted on:

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.

## Learning and Generalization in RNNs

Posted on:

Simple recurrent neural networks (RNNs) and their more advanced cousins LSTMs etc. have been very successful in sequence modeling. Their theoretical understanding, however, is lacking and has not kept pace with the progress for feedforward networks, where a reasonably complete understanding in the special case of highly overparametrized one-hidden-layer networks has emerged. In this paper, we make progress towards remedying this situation by proving that RNNs can learn functions of sequences. In contrast to the previous work that could only deal with functions of sequences that are sums of functions of individual tokens in the sequence, we allow general functions. Conceptually and technically, we introduce new ideas which enable us to extract information from the hidden state of the RNN in our proofs—addressing a crucial weakness in previous work. We illustrate our results on some regular language recognition problems.

## Word2Sense: Sparse Interpretable Word Embeddings

Abhishek Panigrahi, Harsha Vardhan Simhadri, Chiranjib Bhattacharyya
Published at: Association for Computational Linguistics (ACL), 2019

Oral presentation (270/3000 submissions ≈ 9% Acceptance Rate).

[paper] [bib]

## Effect of Activation Functions on the Training of Overparametrized Neural Nets

Abhishek Panigrahi, Abhishek Shetty, Navin Goyal
Published at: International Conference on Learning Representations (ICLR), 2020

[paper]

## Learning and Generalization in RNNs

Abhishek Panigrahi, Navin Goyal
Published at: Neural Information Processing Systems (NeurIPS), 2021

[paper] [bib]

## Understanding Gradient Descent on Edge of Stability in Deep Learning

Sanjeev Arora, Zhiyuan Li, Abhishek Panigrahi $\text{ }^{(\alpha-\beta)}$
Published at: International Conference on Machine Learning (ICML), 2022

[paper]

Sadhika Malladi$\text{ }^{*}$, Kaifeng Lyu$\text{ }^{*}$, Abhishek Panigrahi, Sanjeev Arora
Published at: Neural Information Processing Systems (NeurIPS), 2022

[paper]

## Task-Specific Skill Localization in Fine-tuned Language Models

Abhishek Panigrahi$\text{ }^{*}$, Nikunj Saunshi$\text{ }^{*}$, Haoyu Zhao, Sanjeev Arora
Published at: International Conference on Machine Learning (ICML), 2023

[paper]

## Do Transformers Parse while Predicting the Masked Word?

Haoyu Zhao$\text{ }^{*}$, Abhishek Panigrahi$\text{ }^{*}$, Rong Ge, Sanjeev Arora
Posted on:

[paper]

## Understanding Gradient Descent on Edge of Stability in Deep Learning

Posted on:

Hosted by Praneeth Netrapalli

## Task-Specific Skill Localization in Fine-tuned Language Models

Posted on:

Hosted by Marc’Aurelio Ranzato

## Task-Specific Skill Localization in Fine-tuned Language Models

Posted on:

Hosted by Prof. Yee Whye Teh