Sketchy Polytopes

Accelerating first-order methods

The lower bound on the oracle complexity of continuously differentiable, β-smooth convex function is O(1ϵ) [Theorem 2.1.6, Nesterov04; Theorem 3.8, Bubeck14; Nesterov08]. General first-order gradient descent does not achieve this - e.g. L1-Prox, or ISTA, achieves O(1ϵ). Nesterov, in a series of papers [Nesterov83, Nesterov88, Nesterov07], proposed techniques to improve the convergence rate for smooth functions to O(1t2). In this post, we discuss Nesterov acceleration.

Accelerated gradient descent: α-strongly convex, β-smooth

Algorithm: Given a starting point x1 = y1, then for t ≥ 1

yk+1=xk1βf(xk)xk+1=(1+Q1Q+1)yk+1(Q1Q+1)yk

Theorem: For α-strongly convex, β-smooth functions with condition number Q = β/α, Nesterov’s accelerated gradient descent satisfies [Theorem 3.11, Bubeck14], which comes close to the lower bound [Theorem 2.1.13, Nesterov04]:

f(xk)f(x)α+β2||x1x||2exp(k1Q)

Proof sketch:

  1. Following [Bubeck14], one can define an α-strongly convex quadratic approximation to f(x) that improves with each iteration:
Φ1(x)=f(x1)+α2||xx1||2Φk+1(x)=(11Q)Φk(x)+1Q(f(xk)+f(xk)(xxk)+α2||xkx||2)

Then, using the definition of α-strong convexity, one can show that

Φk+1(x)f(x)+(11Q)k(Φ1(x)f(x))
  1. Now, assuming the following inequality is true,
f(yk)minxΦk(x)
  1. Then one can argue that
f(yk)f(x)Φk(x)f(x)(11Q)t1(Φ1(x)f(x))α+β2||x1x||2(11Q)t1

Accelerated gradient descent: β-smooth

Algorithm: Given a starting point x1 = y1, then for t ≥ 1

λ0=0,λk=1+1+4λ2k12,γk=1λkλk+1yk+1=xk1βf(xk)xk+1=(1γk)yk+1+γkyk

Theorem: For non-strictly convex functions, where α = 0, gradient descent satisfies [Theorem 3.12, Bubeck14], which is within a constant factor of the lower bound [Theorem 2.1.7 in Nesterov04]:

f(xk)f(x)2β||x1x||2k2

Proof sketch:

  1. In a gradient descent scheme for β-smooth functions [Lemma 3.4, Bubeck14]
f(yk+1)f(yk)f(xk)(xkyk)12β||f(xk)||2=β(xkyk+1)(xkyk)β2||xkyk+1||2f(yk+1)f(x)β(xkyk+1)(xkyk)β2||xkyk+1||2
  1. Let δs = f(ys) - f(x^{\ast}), then scaling the first inequality by (λs - 1) and adding the result to second
λkδk+1(λk1)δkβ(xkyk+1)(λkxk(λk1)ykx)β2λk||xkyk+1||2
  1. Scaling by λs and after some manipulation
λ2kδk+1λ2k1δkβ2(||λkxk(λk1)ykx2||2||λkyk+1(λk1)ykx2||2)=β2(||us||2||uk+1||2) where uk=λkxk(λk1)ykx
  1. Summing these inequalities from k = 1 to k = t - 1, yields
δtβ2λ2t1||u1||2, and λt1t2

Interpretation: Chebychev polynomials

Moritz Hardt elegantly describes an interpretation of accelerated gradient descent based on function interpolation [Hardt13]. Gradient descent can be seen as approximating a function using a degree-k polynomial. For strongly convex functions, the derivative can be expressed as a linear combination of previous steps i.e.

xk+1=xk+ηkf(xk)=ATx+b

Assuming that ∑λs = 1,

xk+1x=ks=1(λs(xsx))=ks=1(λsAs)(x1x)=Pk(A)(x1x)where Pk(A)=ks=1(λsAs)

Thus given that

||xk+1x||=||Pk(μ)||A ||x1x||

P(μ) needs to be chosen over all eigenvalues of A so that (1) Pk(μ) = 1, for μ = 0, and (2) its maximum norm is minimized. This is hard since it requires knowing all eigenvalues but can be achieved, under certain conditions, if the domain is assumed to be continuous [Kelner09]. In short, a scaled and shifted Chebychev polynomial is the unique polynomial that minimizes this approximation error. For Chebychev polynomial, the maximum error is

max(||Pk(μ)||)=(Q1Q+1)k

Moreover, since Chebychev polynomials can be expressed recursively, required only the two previously calculated polynomials, the gradient descent update only depends on last two values.

This method seems to be well known in Numerical Analysis, where it is used to speed up iterative linear solvers [chapter 4, Kincaid02; section 7.4, Saad11]. It also seems to predate Nesterov’s work [Manteuffel77].

Extension: Bregman Divergence

Paul Tseng uses Bregman Divergences to give a unified framework for Nesterov’s methods across different classes of problems [Tseng08].

See also [Taboulle10, Candes11, Hao11, Gordon12] for a good overview of first-order acceleration methods. Next post discusses techniques that utilize accelerated gradient descent to solve regularization problems [Nesterov07, Beck09, Becker09].

References

  • Beck09 A. Beck, M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal of Imaging Sciences, 2009
  • Becker09 S. Becker, J. Bobin, E. Candes. NESTA: A Fast and Accurate First-Order Method for Sparse Recovery. Technical Report, Caltech, 2009
  • Bubeck14 S. Bubeck, Theory of Convex Optimization for Machine Learning
  • Candes11 E. Candes. Math 301, Lectures Notes.
  • Chen11 H. Chen, X. Ming. Accelerating Nesterov’s Method for Strongly Convex Functions. Course presentation, 2011.
  • Hardt13 M. Hardt. The Zen of Gradient Descent. Blog, 2013.
  • Karniadakis00 G. E. Karniadakis. R. M. Kirby. Parallel Scientific Computing in C++ and MPI. 2000
  • Kelner09 J. Kelner. Lecture Notes. MIT 18.409, Lecture 22, 2009
  • Kincaid02 D. R. Kincaid, E. W. Cheney. Numerical Analysis: Mathematics of Scientific Computing. AMS, 2002
  • [Nesterov83] Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k2). Dokaldy AN SSR, 1983
  • [Nesterov88] Y. Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonom. i. Mat. Metody, 1988
  • [Nesterov04] Y. Nesterov. Introductory Lectures On Convex Programming: A Basic Course. Kluwer Academic Publishers, 2004
  • Nesterov07 Y. Nesterov. Gradient methods for minimizing composite objective function. Report, CORE, 2007
  • Nesterov08 Y. Nesterov. How to advance in Structural Convex Optimization. Optima, 2008
  • Manteuffel77 T. A. Manteuffel. The Tchebyshev iteration for nonsymmetric linear systems. Numer. Math, 1977.
  • Saad11 Y. Saad. Numerical Methods for Large Eigenvalue Problems. SIAM, 2011
  • Taboulle10. M. Taboulle. First-Order Methods for Optimization. IPAM Optimization Tutorials, 2010
  • Gordon12 G. Gordon, R. Tibshirani. Course Notes, 10-725 Optimization, Fall 2012
  • Tseng08 P. Tseng. On Accelerated Proximal Gradient Algorithms for Convex-Concave Optimization. SIAM Journal of Optimization, 2008

Credit

Main image from Wolfram’s Chebyshev Polynomial of the First Kind: “A beautiful plot can be obtained by plotting Tn(x) radially, increasing the radius for each value of n, and filling in the areas between the curves (Trott 1999, pp. 10 and 84).”