Descent Methods

Descent Methods

Consider the unconstrained minimization problem. The general iteration formula in Descent Methods is xk+1=xk+tkΔxk x_{k+1} = x_k + t_k \Delta x_k, where tkt_kis the step size and Δxk\Delta x_kis the step direction at iterationkk. The outline of a general Descent Method is as follows.

given a starting point xx. repeat 1. Determine a descent direction Δx\Delta x. 2. Line search. Choose a step size t>0t>0. 3. Update. x:=x+tΔxx:=x+t \Delta x. until stopping criterion is satisfied.

Different algorithms differ with respect to the choice of

  • Descent Direction:Δxk\Delta x_k

    • Gradient Method

    • Steepest Descent Method

    • Newton Method

  • Step Size: tkt_k

    • Fixed Step Size

    • Exact Line Search

    • Backtracking Line Search

To choose a Descent Direction, a necessary condition isf(x)TΔx<0\nabla f(x)^T \Delta x < 0.

  • Steepest Descent Method Δx=P1f(x)\Delta x = - P^{-1} \nabla f(x)

    • by choosing different norm P we can have different directions.

  • Gradient Method Δx=f(x)\Delta x = - \nabla f(x)

    • is a special case of the steepest descent method whenPPis identity matrix.

    • is the steepest in the first order approximation.

  • Newton Method Δx=2f(x)1f(x)\Delta x = - \nabla^2 f(x)^{-1} \nabla f(x)

    • is a special case of the steepest descent method whenP=2f(x)P = \nabla^2 f(x)

    • is the steepest in the second order approximation.

References: EE230 Convex Optimization Lecture Slide Topic 6: Unconstrained Optimization; Convex Optimization notes provided by Dr. S. Boyd.

Newton's Method

In numerical analysis, Newton's method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. In each step, the update equation is xn+1=xnf(xn)f(xn)x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}.

Limitations: It may not work if there are points of inflection, local maxima or minima around the initial guessx0x_0or the root. For example, suppose you need to find the root of 27x33x+1=027x^3 - 3x + 1 = 0 which is nearx=0x = 0. Then Newton's method will take you back and forth, or farther away from the root.

Application in Inverse Kinematics:

  • Find θd\theta_d such that xdf(θd)=0x_d - f(\theta_d) = 0where xdx_dis the desired point in taskspace that we would like to reach, andf(θ)f(\theta)is the forward kinematics model of the manipulator.

  • In each iteration, θi+1=θi+J1(θi)(xdf(θi))\theta^{i+1} = \theta^i + J^{-1}(\theta^i)(x_d - f(\theta^i)) where J1(θi)J^{-1}(\theta^i)is the inverse of the Jacobian matrix. For non-square Jacobian matrices, use pseudoinverse instead.

References: https://en.wikipedia.org/wiki/Newton%27s_method; https://brilliant.org/wiki/newton-raphson-method/; Chapter 6 in Modern Robotics textbook

Newton's Method in Optimization

In optimization, Newton's method is applied to the derivativeff'of a twice-differentiable functionffto find the roots of the derivative (solutions tof(x)=0f'(x) = 0), also known as the stationary points offf. These solutions may be minima, maxima, or saddle points.

Putting everything together, Newton's method performs the iteration

xk+1=xkf(xk)f(xk)=xk[f(xk)]1f(xk)=xk[2f(x)]1f(x),x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)} = x_{k}-[f''(x_{k})]^{-1}f'(x_{k}) = x_k - [\nabla ^{2}f(x)]^{-1}\nabla f(x),

wheref(x)\nabla f(x)is the gradient and 2f(x)=Hf(x)\nabla ^{2}f(x) = H_f(x)is the Hessian matrix.

Often Newton's method is modified to include a small step size0<γ10<\gamma \leq 1instead of γ=1\gamma=1:xk+1=xkγ[f(xk)]1f(xk)x_{k+1}=x_{k}-\gamma [f''(x_{k})]^{-1}f'(x_{k}). For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.

Reference: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

Gauss-Newton Algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function.

Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

Starting with an initial guess β(0)\boldsymbol{\beta}^{(0)}for the minimum, the method proceeds by the iterations

β(k+1)=β(k)(JrTJr)1JrTr(β(k)) \boldsymbol{\beta}^{(k+1)}= \boldsymbol{\beta}^{(k)}-\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {r} \left(\boldsymbol{\beta}^{(k)}\right)

Note that(JrTJr)1JrT\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}is the left pseudoinverse of Jr\mathbf {J_{r}}, and assumption mnm \ge n in the algorithm statement is necessary, as otherwise the matrix JrTJr\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} is not invertible and the normal equations cannot be solved uniquely.

Reference: https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm

Levenberg–Marquardt Algorithm

Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent.

With the Gauss–Newton method, the sum of squares of the residualsSSmay not decrease at every iteration. However, sinceΔ\Deltais a descent direction, unlessS(βk)S (\boldsymbol {\beta }^{k})is a stationary point, it holds that S(βk+αΔ)<S(βk) S (\boldsymbol{\beta}^{k}+\alpha \Delta ) < S (\boldsymbol{\beta}^{k}) for all sufficiently smallα>0 \alpha >0. Thus, if divergence occurs, one solution is to employ a fractionα \alpha of the increment vectorΔ\Deltain the updating formula: βk+1=βk+αΔ{\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}+\alpha \Delta. In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective functionSS. An optimal value forα \alphacan be found by using a line search algorithm.

In cases where the direction of the shift vector is such that the optimal fractionα\alphais close to zero, an alternative method for handling divergence is the use of the Levenberg–Marquardt algorithm, a trust region method. The normal equations are modified in such a way that the increment vector is rotated towards the direction of steepest descent,

(JTJ+λD)Δ=JTr\left(\mathbf {J^{\mathsf {T}}J+\lambda D} \right)\Delta =-\mathbf {J} ^{\mathsf {T}}\mathbf {r}

whereDDis a positive diagonal matrix andλ\lambdais a parameter that controls the trust-region size. Geometrically, this adds a paraboloid centered atΔx=0\Delta x=0 to the quadratic form, resulting in a smaller step. Note that whenDDis the identity matrix and λ+\lambda \to +\infty, then λΔJTr\lambda \Delta \to -\mathbf {J} ^{\mathsf {T}}\mathbf {r}, therefore the direction ofΔ\Deltaapproaches the direction of the negative gradient JTr-\mathbf {J} ^{\mathsf {T}}\mathbf {r}.

Conclusions:

  • The main difference between GN and LM methods is the selection of matrix HH. One is H=JTJH=\mathbf{J}^{\mathsf{T}} \mathbf{J} and the other is H=JTJ+λDH=\mathbf{J}^{\mathsf{T}} \mathbf{J} +\lambda D.

  • When the function can be well approximated by second order expansion, we use the Gauss-Newton method.

  • When the approximation to the problem is not good, where the matrix JTJ\mathbf{J}^{\mathsf{T}} \mathbf{J}may not be invertible or under ill condition, we use the Levenberg–Marquardt method.

References: https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm; https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm

Last updated