A survey on Large Scale Optimization

This post contains a summary and survey of the theoretical understandings of Large Scale Optimization by referring some talks, papers, and lectures that I have come across in the recent.

A very important aspect of Machine Learning is Optimization, therefore to have the best results one requires fast and scalable methods before one can appreciate a learning model. Such algorithms involve minimization of a class of functions \(f(\mathbf{x})\), that usually do not have a closed form solution, or even if they have, computing them is expensive in both memory and computation time. Here is where iterative methods turn up to be easy and handy. Analyzing such algorithms involve mathematical analysis of both the function to optimize and the algorithm. This post contains a summary and survey of the theoretical understandings of Large Scale Optimization by referring some talks, papers, and lectures that I have come across in the recent. I hope that the insights of the working of these optimization algorithms will allow the reader to appreciate the rich literature of large scale optimization methods.

The complete PDF post can be viewed here.

Readers please note that the article is a compilation of popular and interesting results, and is not meant for publication at any case.

Twitter Facebook LinkedIn

Analysis of Newton’s Method

less than 1 minute read

Posted on: October 12, 2019

In optimization, Netwon’s method is used to find the roots of the derivative of a twice differentiable function given the oracle access to its gradient and hessian. By having super-liner memory in the dimension of the ambient space, Newton’s method can take the advantage of the second order curvature and optimize the objective function at a quadratically convergent rate. Here I consider the case when the objective function is smooth and strongly convex.

Deriving the Fokker-Planck equation

less than 1 minute read

Posted on: June 11, 2019

In the theory of dynamic systems, Fokker-Planck equation is used to describe the time evolution of the probability density function. It is a partial differential equation that describes how the density of a stochastic process changes as a function of time under the influence of a potential field. Some common application of it are in the study of Brownian motion, Ornstein–Uhlenbeck process, and in statistical physics. The motivation behind understanding the derivation is to study Levy flight processes that has caught my recent attention.

SGD without replacement

1 minute read

Posted on: March 24, 2019

This article is in continuation of my previous blog, and discusses about the work by Prateek Jain, Dheeraj Nagaraj and Praneeth Netrapalli 2019. The authors provide tight rates for SGD without replacement for general smooth, and general smooth and strongly convex functions using the method of exchangeable pairs to bound Wasserstein distances, and techniques from optimal transport.

Non-asymptotic rate for Random Shuffling for Quadratic functions

less than 1 minute read

Posted on: July 12, 2018

This article is in continuation of my previous blog, and discusses about a section of the work by Jeffery Z. HaoChen and Suvrit Sra 2018, in which the authors come up with a non-asymptotic rate of \(\mathcal{O}\left(\frac{1}{T^2} + \frac{n^3}{T^3} \right)\) for Random Shuffling Stochastic algorithm which is strictly better than that of SGD.