# Descent Methods

## Descent Methods

Consider the unconstrained minimization problem. The general iteration formula in Descent Methods is $x_{k+1} = x_k + t_k \Delta x_k$, where $t_k$is the step size and $\Delta x_k$is the step direction at iteration$k$. The outline of a general Descent Method is as follows.

**given** a starting point $x$.
**repeat**
1. Determine a descent direction $\Delta x$.
2. Line search. Choose a step size $t>0$.
3. Update. $x:=x+t \Delta x$.
**until** stopping criterion is satisfied.

Different algorithms differ with respect to the choice of

Descent Direction:$\Delta x_k$

Gradient Method

Steepest Descent Method

Newton Method

Step Size: $t_k$

Fixed Step Size

Exact Line Search

Backtracking Line Search

To choose a Descent Direction, a necessary condition is$\nabla f(x)^T \Delta x < 0$.

Steepest Descent Method $\Delta x = - P^{-1} \nabla f(x)$

by choosing different norm P we can have different directions.

Gradient Method $\Delta x = - \nabla f(x)$

is a special case of the steepest descent method when$P$is identity matrix.

is the steepest in the first order approximation.

Newton Method $\Delta x = - \nabla^2 f(x)^{-1} \nabla f(x)$

is a special case of the steepest descent method when$P = \nabla^2 f(x)$

is the steepest in the second order approximation.

References: EE230 Convex Optimization Lecture Slide Topic 6: Unconstrained Optimization; Convex Optimization notes provided by Dr. S. Boyd.

## Newton's Method

In numerical analysis, Newton's method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. In each step, the update equation is $x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$.

Limitations: It may not work if there are points of inflection, local maxima or minima around the initial guess$x_0$or the root. For example, suppose you need to find the root of $27x^3 - 3x + 1 = 0$ which is near$x = 0$. Then Newton's method will take you back and forth, or farther away from the root.

Application in Inverse Kinematics:

Find $\theta_d$ such that $x_d - f(\theta_d) = 0$where $x_d$is the desired point in taskspace that we would like to reach, and$f(\theta)$is the forward kinematics model of the manipulator.

In each iteration, $\theta^{i+1} = \theta^i + J^{-1}(\theta^i)(x_d - f(\theta^i))$ where $J^{-1}(\theta^i)$is the inverse of the Jacobian matrix. For non-square Jacobian matrices, use pseudoinverse instead.

References: https://en.wikipedia.org/wiki/Newton%27s_method; https://brilliant.org/wiki/newton-raphson-method/; Chapter 6 in Modern Robotics textbook

## Newton's Method in Optimization

In optimization, Newton's method is applied to the derivative$f'$of a twice-differentiable function$f$to find the roots of the derivative (solutions to$f'(x) = 0$), also known as the stationary points of$f$. These solutions may be minima, maxima, or saddle points.

Putting everything together, Newton's method performs the iteration

$x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)} = x_{k}-[f''(x_{k})]^{-1}f'(x_{k}) = x_k - [\nabla ^{2}f(x)]^{-1}\nabla f(x),$

where$\nabla f(x)$is the gradient and $\nabla ^{2}f(x) = H_f(x)$is the Hessian matrix.

Often Newton's method is modified to include a small step size$0<\gamma \leq 1$instead of $\gamma=1$:$x_{k+1}=x_{k}-\gamma [f''(x_{k})]^{-1}f'(x_{k})$. For step sizes other than 1, the method is often referred to as the relaxed or **damped Newton's method**.

Reference: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

## Gauss-Newton Algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function.

Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

Starting with an initial guess $\boldsymbol{\beta}^{(0)}$for the minimum, the method proceeds by the iterations

$\boldsymbol{\beta}^{(k+1)}= \boldsymbol{\beta}^{(k)}-\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {r} \left(\boldsymbol{\beta}^{(k)}\right)$

Note that$\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}$is the left pseudoinverse of $\mathbf {J_{r}}$, and assumption $m \ge n$ in the algorithm statement is necessary, as otherwise the matrix $\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}}$ is not invertible and the normal equations cannot be solved uniquely.

Reference: https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm

## Levenberg–Marquardt Algorithm

Levenberg–Marquardt algorithm (LMA), also known as the **damped** least-squares (DLS) method, is used to solve non-linear least squares problems. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent.

With the Gauss–Newton method, the sum of squares of the residuals$S$may not decrease at every iteration. However, since$\Delta$is a descent direction, unless$S (\boldsymbol {\beta }^{k})$is a stationary point, it holds that $S (\boldsymbol{\beta}^{k}+\alpha \Delta ) < S (\boldsymbol{\beta}^{k})$ for all sufficiently small$\alpha >0$. Thus, **if divergence occurs**, one solution is to employ a fraction$\alpha$of the increment vector$\Delta$in the updating formula: ${\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}+\alpha \Delta$. In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective function$S$. An optimal value for$\alpha$can be found by using a line search algorithm.

In cases where the direction of the shift vector is such that the optimal fraction$\alpha$is close to zero, an alternative method for handling divergence is the use of the Levenberg–Marquardt algorithm, a trust region method. The normal equations are modified in such a way that the increment vector is **rotated towards the direction of steepest descent**,

$\left(\mathbf {J^{\mathsf {T}}J+\lambda D} \right)\Delta =-\mathbf {J} ^{\mathsf {T}}\mathbf {r}$

where$D$is a positive diagonal matrix and$\lambda$is a parameter that controls the trust-region size. Geometrically, this adds a paraboloid centered at$\Delta x=0$ to the quadratic form, resulting in a smaller step. Note that when$D$is the identity matrix and $\lambda \to +\infty$, then $\lambda \Delta \to -\mathbf {J} ^{\mathsf {T}}\mathbf {r}$, therefore the direction of$\Delta$approaches the direction of the negative gradient $-\mathbf {J} ^{\mathsf {T}}\mathbf {r}$.

Conclusions:

The main difference between GN and LM methods is the selection of matrix $H$. One is $H=\mathbf{J}^{\mathsf{T}} \mathbf{J}$ and the other is $H=\mathbf{J}^{\mathsf{T}} \mathbf{J} +\lambda D$.

When the function can be well approximated by second order expansion, we use the Gauss-Newton method.

When the approximation to the problem is not good, where the matrix $\mathbf{J}^{\mathsf{T}} \mathbf{J}$may not be invertible or under ill condition, we use the Levenberg–Marquardt method.

References: https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm; https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm

Last updated