Jekyll2020-04-12T13:16:55-07:00https://abhishekpanigrahi1996.github.io/feed.xmlAbhishek PanigrahiResearch Fellow at Microsoft Research IndiaAbhishek Panigrahit-abpani@microsoft.comAnalysis of Newton’s Method2019-10-12T00:00:00-07:002019-10-12T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2019/10/blog-post-9<p>In optimization, Netwon’s method is used to find the roots of the derivative of a twice differentiable function given the oracle access to its gradient and hessian. By having super-liner memory in the dimension of the ambient space, Newton’s method can take the advantage of the second order curvature and optimize the objective function at a quadratically convergent rate. Here I consider the case when the objective function is smooth and strongly convex.</p>
<p>The complete PDF post can be viewed <a href="\files\newton.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comIn optimization, Netwon's method is used to find the roots of the derivative of a twice differentiable function given the oracle access to its gradient and hessian. By having super-liner memory in the dimension of the ambient space, Newton's method can take the advantage of the second order curvature and optimize the objective function at a quadratically convergent rate. Here I consider the case when the objective function is smooth and strongly convex.Deriving the Fokker-Planck equation2019-06-11T00:00:00-07:002019-06-11T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2019/06/blog-post-8<p>In the theory of dynamic systems, Fokker-Planck equation is used to describe the time evolution of the probability density function. It is a partial differential equation that describes how the density of a stochastic process changes as a function of time under the influence of a potential field. Some common application of it are in the study of Brownian motion, Ornstein–Uhlenbeck process, and in statistical physics. Here I attempt to note the formal derivation of the partial differential equation by deriving a master equation and using Taylor series to obtain the Kramers-Moyal expansion. A special case of the expansion with finite sum is called the Fokker-Planck equation.</p>
<p>The complete PDF post can be viewed <a href="\files\fokker_planck.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comIn the theory of dynamic systems, Fokker-Planck equation is used to describe the time evolution of the probability density function. It is a partial differential equation that describes how the density of a stochastic process changes as a function of time under the influence of a potential field. Some common application of it are in the study of Brownian motion, Ornstein–Uhlenbeck process, and in statistical physics. The motivation behind understanding the derivation is to study Levy flight processes that has caught my recent attention.SGD without replacement2019-03-24T00:00:00-07:002019-03-24T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2019/03/blog-post-7<p>This article is in continuation of my <a href="https://raghavsomani.github.io/posts/2018/04/blog-post-6/">previous blog</a>, and discusses about the work by <a href="https://arxiv.org/pdf/1903.01463.pdf" target="_blank">Prateek Jain, Dheeraj Nagaraj and Praneeth Netrapalli 2019</a> which attempts to answer Léon Bottou’s (2009) open question of understanding SGD without replacement. The authors provide tight rates for SGD without replacement for general smooth, and general smooth and strongly convex functions using the method of exchangeable pairs to bound Wasserstein distances, and techniques from optimal transport. They show that SGD without replacement on general smooth and strongly convex functions can achieve a rate of <script type="math/tex">\mathcal{O}\left( \frac{1}{K^2} \right)</script> where <script type="math/tex">K</script> is the number of passes over the data. This result requires <script type="math/tex">K\in\mathcal{\Omega}(\kappa^2)</script> where $\kappa$ is the condition number of the problem. They show that SGD without replacement matches the rate of SGD with replacement when $K$ is smaller than <script type="math/tex">\mathcal{O}(\kappa^2)</script>. This is strictly better than SGD with replacement which has a rate of <script type="math/tex">\mathcal{O}\left( \frac{1}{K} \right)</script> in the similar setting. Their analysis does not require the Hessian Lipschitz condition as required by other previous works and holds for any general smooth and strongly convex function. They also show that SGD without replacement is at least as good as SGD with replacement for general smooth convex functions in the absence of strong convexity.</p>
<p>The complete PDF post can be viewed <a href="\files\SGD_without_replacement.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comThis article is in continuation of my [previous blog](https://raghavsomani.github.io/posts/2018/04/blog-post-6/), and discusses about the work by [Prateek Jain, Dheeraj Nagaraj and Praneeth Netrapalli 2019](https://arxiv.org/pdf/1903.01463.pdf){:target="_blank"}. The authors provide tight rates for SGD without replacement for general smooth, and general smooth and strongly convex functions using the method of exchangeable pairs to bound Wasserstein distances, and techniques from optimal transport.Non-asymptotic rate for Random Shuffling for Quadratic functions2018-07-12T00:00:00-07:002018-07-12T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2018/07/blog-post-6<p>This article is in continuation of my <a href="https://raghavsomani.github.io/posts/2018/04/blog-post-4/" target="_blank">previous blog</a>, and discusses about a section of the work by <a href="https://arxiv.org/pdf/1806.10077.pdf" target="_blank">Jeffery Z. HaoChen and Suvrit Sra 2018</a>, in which the authors come up with a non-asymptotic rate of <script type="math/tex">\mathcal{O}\left(\frac{1}{T^2} + \frac{n^3}{T^3} \right)</script> for Random Shuffling Stochastic algorithm which is strictly better than that of SGD. The article talks about the simple case when the objective function is a sum of quadratic functions where with a fixed step-size and after a reasonable number of epochs, we can guarentee a faster rate for Random Shuffling.</p>
<p>The complete PDF post can be viewed <a href="\files\RRQuadratic.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comThis article is in continuation of my [previous blog](https://raghavsomani.github.io/posts/2018/04/blog-post-4/), and discusses about a section of the work by [Jeffery Z. HaoChen and Suvrit Sra 2018](https://arxiv.org/pdf/1806.10077.pdf){:target="_blank"}, in which the authors come up with a non-asymptotic rate of $$\mathcal{O}\left(\frac{1}{T^2} + \frac{n^3}{T^3} \right)$$ for Random Shuffling Stochastic algorithm which is strictly better than that of SGD.Bias-Variance Trade-offs for Averaged SGD in Least Mean Squares2018-07-04T00:00:00-07:002018-07-04T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2018/07/blog-post-5<p>This article is on the work by <a href="https://arxiv.org/pdf/1412.0156.pdf" target="_blank">Défossez and Bach 2014</a>, in which the authors develop an operator view point for analyzing Averaged SGD updates to show the Bias-Variance Trade-off and provide tight convergence rates of Least Mean Squared problem.</p>
<p>The complete PDF post can be viewed <a href="\files\BiasVariance.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comThis article is on the work by [Défossez and Bach 2014](https://arxiv.org/pdf/1412.0156.pdf){:target="_blank"}, in which the authors develop an operator view point for analyzing Averaged SGD updates to show the Bias-Variance Trade-off and provide tight convergence rates of Least Mean Squared problem.Random Reshuffling converges to a smaller neighborhood than SGD2018-04-01T00:00:00-07:002018-04-01T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2018/04/blog-post-4<p>This article is on the recent work by <a href="https://arxiv.org/pdf/1803.07964.pdf" target="_blank">Ying et. al. 2018</a>, in which the authors show that SGD with Random Reshuffling outperforms independent sampling with replacement by showing that the MSE of the iterates at the end of each epoch is of the order <script type="math/tex">O(\eta^2)</script> for constant step-size <script type="math/tex">\eta</script>. This is a significant improvement compared to the traditional SGD with i.i.d. sampling where the same quantity is of the order <script type="math/tex">O(\eta)</script>.</p>
<p>The complete PDF post can be viewed <a href="\files\SGDvsRR.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comThis article is on the recent work by [Ying et. al. 2018](https://arxiv.org/pdf/1803.07964.pdf){:target="_blank"}, in which the authors show that SGD with Random Reshuffling outperforms independent sampling with replacement.Nesterov’s Acceleration2018-03-30T00:00:00-07:002018-03-30T00:00:00-07:00https://abhishekpanigrahi1996.github.io/posts/2018/03/blog-post-3<p>This post contains an error vector analysis of the Nesterov’s accelerated gradient descent method and some insightful implications that can be derived from it.</p>
<p>Refer the document <a href="\files\nesterov.pdf" target="_blank">here</a>.</p>Abhishek Panigrahit-abpani@microsoft.comThis post contains an error vector analysis of the Nesterov’s accelerated gradient descent method and some insightful implications that can be derived from it.Some resources to start with Fundamentals of Machine Learning2018-01-06T00:00:00-08:002018-01-06T00:00:00-08:00https://abhishekpanigrahi1996.github.io/posts/2018/01/blog-post-2<p>With a number of courses, books and reading material out there here is a list of some which I personally find useful for building a fundamental understanding in Machine Learning.</p>
<p>Machine Learning at a higher level requires some mathematical prerequisites which are at the heart of it.</p>
<ol>
<li>Learning Theory</li>
<li>Optimization</li>
<li>Statistical learning and high dimensional probability theory</li>
</ol>
<p>Some really nice resources might be the ones below</p>
<ol>
<li>Learning Theory
<ol>
<li><a href="https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A">Learning from Data - Caltech</a>.</li>
<li>The initial chapters from <a href="https://libgen.pw/download/book/5a1f05453a044650f50e3ec5">Foundations of Machine Learning - Mohri</a>, or Part I from <a href="http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf">Understanding Machine Learning From Theory to Algorithms - Shai Shalev-Shwartz and Shai Ben-David</a>.</li>
</ol>
</li>
<li>Optimization for Machine Learning
<ol>
<li>Large scale optimization for Machine Learning - Talks by Suvrit Sra - <a href="https://www.youtube.com/watch?v=AYcfpq5hH5g">Part 1</a>, <a href="https://www.youtube.com/watch?v=rNwkvvm2Hes">Part 2</a>, and <a href="https://www.youtube.com/watch?v=YwaWFto9KoQ">Part 3</a> - <a href="http://suvrit.de/teach/msr2015/index.html">Slides</a>.</li>
<li>Convex Optimization literature - <a href="https://www.youtube.com/watch?v=McLq1hEq3UY&list=PL3940DD956CDF0622">Convex Optimization course by Stephen Boyd</a> <a href="http://mlss11.bordeaux.inria.fr/docs/mlss11Bordeaux_Vandenberghe.pdf">Slides</a>, and the classical book on <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.855&rep=rep1&type=pdf">Introductory Lectures on Convex Programming - Yuri Nesterov</a>.</li>
<li><a href="https://arxiv.org/pdf/1712.07897.pdf">Non-convex Optimization for Machine Learning - Jain and Kar</a>.</li>
<li><a href="http://suvrit.de/mit/optml++/index.html">OPTML++ page by Suvrit Sra</a>.</li>
</ol>
</li>
<li>Statistical Learning and Probabilistic Machine Lerning
<ol>
<li><a href="https://www.youtube.com/user/dataschool/playlists?sort=dd&view=50&shelf_id=4">Introduction to Statistical Learning - Trevor Hastie and Robert Tibshirani</a> - <a href="http://www-bcf.usc.edu/~gareth/ISL/ISLR\%20First\%20Printing.pdf">Introductory Book</a>, <a href="https://web.stanford.edu/~hastie/Papers/ESLII.pdf">Advanced Book</a>.</li>
<li><a href="http://dsd.future-lab.cn/members/2015nlp/Machine_Learning.pdf">Machine Learning: A Probabilistic Perspective - Kevin P Murphy</a>.</li>
</ol>
</li>
</ol>Abhishek Panigrahit-abpani@microsoft.comWith a number of courses, books and reading material out there here is a list of some which I personally find useful for building a fundamental understanding in Machine Learning.A survey on Large Scale Optimization2017-12-11T00:00:00-08:002017-12-11T00:00:00-08:00https://abhishekpanigrahi1996.github.io/posts/2017/12/blog-post-1<p>A very important aspect of Machine Learning is Optimization, therefore to have the best results one requires fast and scalable methods before one can appreciate a learning model. Such algorithms involve minimization of a class of functions <script type="math/tex">f(\mathbf{x})</script>, that usually do not have a closed form solution, or even if they have, computing them is expensive in both memory and computation time. Here is where iterative methods turn up to be easy and handy. Analyzing such algorithms involve mathematical analysis of both the function to optimize and the algorithm. This post contains a summary and survey of the theoretical understandings of Large Scale Optimization by referring some talks, papers, and lectures that I have come across in the recent. I hope that the insights of the working of these optimization algorithms will allow the reader to appreciate the rich literature of large scale optimization methods.</p>
<p>The complete PDF post can be viewed <a href="\files\largescaleopt.pdf" target="_blank">here</a>.</p>
<p>Readers please note that the article is a compilation of popular and interesting results, and is not meant for publication at any case.</p>Abhishek Panigrahit-abpani@microsoft.comThis post contains a summary and survey of the theoretical understandings of Large Scale Optimization by referring some talks, papers, and lectures that I have come across in the recent.