Optimization Part 1 Basics and Least Squares

11 minute read ·

Published:

In common engineering problems, the optimization problem is to minimize the residual of a parametric model with respect to some observed data points.

Preliminaries

Given NN measurements {(xi,yi)}i=1N\left\{\left(\mathbf{x}_i, \mathbf{y}_i\right)\right\}_{i=1}^N where xiRp\mathbf{x}_i\in\mathbb{R}^p and yiRmi\mathbf{y}_i\in\mathbb{R}^{m_i}. We define a prediction model parametrized by θRn\boldsymbol{\theta}\in\mathbb{R}^n as h(;θ):RpRmi\mathbf{h}\left(\cdot;\boldsymbol{\theta}\right):\mathbb{R}^p\to\mathbb{R}^{m_i} and the residual of a single datapoint as ri:RnRmi\mathbf{r}_i:\mathbb{R}^n\to\mathbb{R}^{m_i}, where ri(θ)h(xi;θ)yi\mathbf{r}_i\left(\boldsymbol{\theta}\right)\coloneqq \mathbf{h}\left(\mathbf{x}_i;\boldsymbol{\theta}\right)-\mathbf{y}_i. The full residual r:RnRm\mathbf{r}:\mathbb{R}^{n}\to\mathbb{R}^m is simply a stack of every single residuals, i.e.,

r(θ)=[r1(θ)r2(θ)rN(θ)]Rm,m=i=1Nmi\begin{align*} \mathbf{r}\left(\boldsymbol{\theta}\right)= \begin{bmatrix} \mathbf{r}_1\left(\boldsymbol{\theta}\right)\\ \mathbf{r}_2\left(\boldsymbol{\theta}\right)\\ \vdots \\ \mathbf{r}_N\left(\boldsymbol{\theta}\right)\\ \end{bmatrix} \in\mathbb{R}^m, && m=\sum_{i=1}^{N}m_i \end{align*}

Note that we require 𝑚≥𝑛 because otherwise the prediction model cannot be fully defined, i.e., there are infinitely many 𝜃 that perfectly fit the data.

The goal of an optimization problem is to find a model that best fits the data. To formalize this, we need to define some function f:RnR0f:\mathbb{R}^n\to\mathbb{R}_{\geq 0} that properly maps the residual vector to a minimizable scalar, which is known as the objective function. In general, the function ff is assumed to be continuously differentiable. Some of the common objective functions are

  • Least Squares
f(θ)=12r(θ)2=12i=1Nri(θ)2 f(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2 = \frac{1}{2}\sum_{i=1}^{N}\|\mathbf{r}_i(\boldsymbol{\theta})\|^2
  • Least Squares with 1\ell_1 regularization
  f(θ)=12r(θ)2+λθ1=12i=1Nri(θ)2+λj=1nθj    f(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2 + \lambda\|\boldsymbol{\theta}\|_1 = \frac{1}{2}\sum_{i=1}^{N}\|\mathbf{r}_i(\boldsymbol{\theta})\|^2 + \lambda\sum_{j=1}^{n}|\theta_j|
  • Least Squares with 2\ell_2 regularization
f(θ)=12r(θ)2+λ2θ2=12i=1Nri(θ)2+λ2j=1nθj2 f(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2 + \frac{\lambda}{2}\|\boldsymbol{\theta}\|^2= \frac{1}{2}\sum_{i=1}^{N}\|\mathbf{r}_i(\boldsymbol{\theta})\|^2 + \frac{\lambda}{2}\sum_{j=1}^{n}\theta_j^2
  • Weighted Least Squares
f(θ)=12r(θ)Wr(θ)=12i=1Nri(θ)wiri(θ)f(\boldsymbol{\theta}) = \frac{1}{2}\mathbf{r}(\boldsymbol{\theta})^\top \mathbf{W}\mathbf{r}(\boldsymbol{\theta}) = \frac{1}{2}\sum_{i=1}^{N} \mathbf{r}_i(\boldsymbol{\theta})^\top w_i\mathbf{r}_i(\boldsymbol{\theta})
  • Cauchy Loss
f(θ)=i=1Nρ ⁣(ri(θ)),ρ(r)=α22log ⁣(1+r2α2)\begin{align*} f(\boldsymbol{\theta}) = \sum_{i=1}^{N} \rho\!\left(\|\mathbf{r}_i (\boldsymbol{\theta})\|\right), && \rho(r) = \frac{\alpha^2}{2}\log\!\left(1 + \frac{r^2}{\alpha^2}\right) \end{align*}

The following is a list of some of the theorems related to optimality conditions. Suppose ff is differentiable, and define the gradient (first-order derivative) and Hessian (second-order derivative) as

f(θ)[fθ1fθn]RnandH(θ)[2fθ122fθ1θ22fθ1θn2fθ2θ12fθ222fθ2θn2fθnθ12fθnθ22fθn2]Sn\begin{align*} \nabla f(\boldsymbol{\theta}) \coloneqq \begin{bmatrix} \dfrac{\partial f}{\partial \theta_1} \\[10pt] \vdots \\[6pt] \dfrac{\partial f}{\partial \theta_n} \end{bmatrix}\in\mathbb{R}^n &&\text{and}&& \mathbf{H}(\boldsymbol{\theta}) \coloneqq \begin{bmatrix} \dfrac{\partial^2 f}{\partial \theta_1^2} & \dfrac{\partial^2 f}{\partial \theta_1 \partial \theta_2} & \cdots & \dfrac{\partial^2 f}{\partial \theta_1 \partial \theta_n} \\[12pt] \dfrac{\partial^2 f}{\partial \theta_2 \partial \theta_1} & \dfrac{\partial^2 f}{\partial \theta_2^2} & \cdots & \dfrac{\partial^2 f}{\partial \theta_2 \partial \theta_n} \\[12pt] \vdots & \vdots & \ddots & \vdots \\[6pt] \dfrac{\partial^2 f}{\partial \theta_n \partial \theta_1} & \dfrac{\partial^2 f}{\partial \theta_n \partial \theta_2} & \cdots & \dfrac{\partial^2 f}{\partial \theta_n^2} \end{bmatrix}\in\mathbb{S}^n \end{align*}
  1. Descent direction: If there is a vector dRn\mathbf{d}\in\mathbb{R}^n such that f(θ0)d<0\nabla f(\boldsymbol{\theta}_0)^\top\mathbf{d}<0, then for all sufficiently small λ>0\lambda>0, f(θ0+λd)<f(θ0)f(\boldsymbol{\theta}_0+\lambda\mathbf{d})<f(\boldsymbol{\theta}_0). Here, d\mathbf{d} is a descent direction of ff at θ0\boldsymbol{\theta}_0.

    Intuitively, the dot product less than 0 means that vector d\mathbf{d} is in the opposite direction as the gradient.

  1. First-order necessary condition: If θ\boldsymbol{\theta}^* is a local minimum, then f(θ)=0\nabla f(\boldsymbol{\theta}^*)=0.
  1. Second-order necessary condition: If θ\boldsymbol{\theta}^* is a local minimum, then f(θ)=0\nabla f(\boldsymbol{\theta}^*)=0 and H(θ)0\mathbf{H}(\boldsymbol{\theta}^*)\succeq 0 (positive semi-definite).
  1. Second-order sufficient condition: If f(θ)=0\nabla f(\boldsymbol{\theta}^*)=0 and H(θ)0\mathbf{H}(\boldsymbol{\theta}^*)\succ 0 (positive definite), then θ\boldsymbol{\theta}^* is a local minimum.
  1. Convex function: ff is convex if and only if H(θ)0\mathbf{H}(\boldsymbol{\theta})\succeq 0 for all θ\boldsymbol{\theta}.
  1. Global minimum: Suppose ff is convex, then θ\boldsymbol{\theta}^\star is a global minimum if an only if f(θ)=0\nabla f(\boldsymbol{\theta}^\star)=0.

Linear Least Squares

Let ARm×n\mathbf{A}\in\mathbb{R}^{m\times n} be a matrix and bRm\mathbf{b}\in\mathbb{R}^m be a vector, where m>nm>n and rank(A)=n\mathrm{rank}(\mathbf{A})=n. The Linear Least Squares (LLS) optimization problem is to find the optimal solution xRn\mathbf{x}\in\mathbb{R}^n to the linear system of equations Ax=b\mathbf{A}\mathbf{x}=\mathbf{b}. Define the objective function as

f(x)=12Axb2f(\mathbf{x})=\frac{1}{2}\left\|\mathbf{Ax-b}\right\|^2

and the goal is to find x=arg minxRnf(x)\mathbf{x}^{\star}=\argmin_{\mathbf{x}\in\mathbb{R}^n}f(\mathbf{x}).

First, we can write out the gradient and Hessian matrix of ff, which are

f(x)=AAxAbH(x)=AA\begin{align*} \nabla f(\mathbf{x})=\mathbf{A^\top A}\mathbf{x}-\mathbf{A^\top b} && \mathbf{H}(\mathbf{x})=\mathbf{A^\top A} \end{align*}
  • Derivation

    Simply expand the objective function,

    f(x)=12(Axb)(Axb)=12(xAb)(Axb)=12(xAAxxAbbAx+bb)=12xAAxxAb+12bb\begin{align*} f(\mathbf{x})&=\frac{1}{2}\left(\mathbf{Ax-b}\right)^\top\left(\mathbf{Ax-b}\right)\\ &=\frac{1}{2}\left(\mathbf{x}^\top\mathbf{A}^\top-\mathbf{b}^\top\right)\left(\mathbf{Ax-b}\right)\\ &=\frac{1}{2}\left(\mathbf{x^\top A^\top Ax}-\mathbf{x^\top A^\top b}-\mathbf{b^\top Ax}+\mathbf{b^\top b}\right)\\ &=\frac{1}{2}\mathbf{x^\top A^\top Ax}-\mathbf{x^\top A^\top b}+\frac{1}{2}\mathbf{b^\top b} \end{align*}

    Note that the last step is true because bAxRb^\top Ax\in\mathbb{R} and bAx=(bAx)=xAb\mathbf{b^\top Ax=(b^\top Ax)^\top=x^\top A^\top b}.

Further notice that because of rank(A)=n\mathrm{rank}(\mathbf{A})=n, the Hessian matrix is positive definite (and thus invertible), i.e., AA0\mathbf{A^\top A}\succ 0. Then x\mathbf{x}^\star is the global minimum if and only if f(x)=0\nabla f(\mathbf{x}^\star)=0,

AAxAb=0    x=(AA)1Ab\begin{align*} \mathbf{A^\top A x}^\star-\mathbf{A^\top b}=0 && \implies &&\boxed{\mathbf{x^\star}=\left(\mathbf{A^\top A}\right)^{-1}\mathbf{A^\top b}} \end{align*}

Although the above equation gives a closed-form formula for finding the global minimum x\mathbf{x}^\star, computing the inverse of AA\mathbf{A^\top A} is not practical. Instead, we solve a linear system of equations

(AA)x=Ab\begin{equation} \left(\mathbf{A^\top A}\right)\mathbf{x}^\star = \mathbf{A^\top b} \end{equation}

which is known as the normal equations.

Therefore, solving an LLS problem is basically solving a linear system of equations. Typical methods include Cholesky decomposition and QR factorization, rather than directly inverting AA\mathbf{A^\top A}.

Cholesky Solver

The Cholesky decomposition of a positive-definite matrix ARn×n\mathbf{A}\in\mathbb{R}^{n\times n} is a decomposition into the product of an upper triangular matrix and a lower triangular matrix, i.e., A=LL\mathbf{A}=\mathbf{LL^\top}, where

A=[a11a12a1na21a22a2nan1an2ann]L=[l1100l21l220ln1ln2lnn]L=[l11l21ln10l22ln200lnn]\begin{align*} \mathbf{A}=\begin{bmatrix}a_{11} & a_{12} & \cdots & a_{1n}\\a_{21} & a_{22} & \cdots & a_{2n}\\\vdots & \vdots & \ddots & \vdots\\a_{n1} & a_{n2} & \cdots & a_{nn}\end{bmatrix} &&\mathbf{L}=\begin{bmatrix}l_{11} & 0 & \cdots & 0\\l_{21} & l_{22} & \cdots & 0\\\vdots & \vdots & \ddots & \vdots\\l_{n1} & l_{n2} & \cdots & l_{nn}\end{bmatrix} &&\mathbf{L^\top}=\begin{bmatrix}l_{11} & l_{21} & \cdots & l_{n1}\\0 & l_{22} & \cdots & l_{n2}\\\vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & l_{nn}\end{bmatrix} \end{align*}

The elements of L\mathbf{L} can be obtained as follows

lii=aiik=1i1lik2,i=1,,nlji=1lii(ajik=1i1ljklik),j=i+1,,n\begin{align*}l_{ii} &= \sqrt{a_{ii}-\sum_{k=1}^{i-1} l_{ik}^2}, \qquad i=1,\dots,n\\l_{ji} &= \frac{1}{l_{ii}}\left(a_{ji}-\sum_{k=1}^{i-1} l_{jk}l_{ik}\right), \qquad j=i+1,\dots,n\end{align*}

For a linear system of equations, Ax=b\mathbf{Ax}=\mathbf{b}, apply Cholesky decomposition and get LLx=b\mathbf{LL^\top x=b}. Then, we first let y=Lx\mathbf{y}=\mathbf{L^\top x}, and solve Ly=b\mathbf{Ly=b} for y\mathbf{y} using forward substitution (y1y2yny_1\to y_2\to\cdots \to y_n). Specifically,

Ly=b    [l1100l21l220ln1ln2lnn][y1y2yn]=[b1b2bn]\begin{align*} \mathbf{Ly=b} && \implies && \begin{bmatrix}l_{11} & 0 & \cdots & 0\\l_{21} & l_{22} & \cdots & 0\\\vdots & \vdots & \ddots & \vdots\\l_{n1} & l_{n2} & \cdots & l_{nn}\end{bmatrix} \begin{bmatrix} y_1\\y_2\\ \vdots \\y_n \end{bmatrix}= \begin{bmatrix} b_1\\ b_2\\ \vdots \\b_n \end{bmatrix} \end{align*}

Note that y1y_1 can be easily found as y1=b1/l11y_1 = b_1/l_{11}, and substituting y1y_1 into the second equation solves y2y_2. In general,

yi=1lii(bij=1i1lijyj),i=1,,n\begin{align*} y_i &= \frac{1}{l_{ii}}\left(b_i - \sum_{j=1}^{i-1} l_{ij}y_j\right),\qquad i=1,\dots,n \end{align*}

Finally, solve Lx=y\mathbf{L^\top x}=\mathbf{y} for x\mathbf{x} using backward substitution (xnxn1x1x_n\to x_{n-1}\to\cdots\to x_1). Specifically,

Lx=y    [l11l21ln10l22ln200lnn][x1x2xn]=[y1y2yn]\begin{align*} \mathbf{L^\top x=y} && \implies && \begin{bmatrix} l_{11} & l_{21} & \cdots & l_{n1}\\ 0 & l_{22} & \cdots & l_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & l_{nn} \end{bmatrix} \begin{bmatrix} x_1\\x_2\\ \vdots \\x_n \end{bmatrix} = \begin{bmatrix} y_1\\ y_2\\ \vdots \\y_n \end{bmatrix} \end{align*}

Now, xnx_n can be easily found as xn=yn/lnnx_n=y_n/l_{nn}, and substituting xnx_n into the second to last equation solves xn1x_{n-1}. In general,

xi=1lii(yij=i+1nljixj),i=n,,1\begin{align*}x_i &= \frac{1}{l_{ii}}\left(y_i - \sum_{j=i+1}^{n} l_{ji}x_j\right),\qquad i=n,\dots,1\end{align*}

In practice, we can use numpy.linalg.cholesky or similar packages to perform Cholesky decomposition. The time complexity is O(n3)O(n^3).

QR Solver

The QR factorization of a matrix ARn×n\mathbf{A}\in\mathbb{R}^{n\times n} is a decomposition into an orthogonal matrix and an upper triangular matrix, i.e., A=QR\mathbf{A}=\mathbf{QR}, where

Q=[qij],QQ=InR=[r11r21rn10r22rn200rnn]\begin{align*} \mathbf{Q}=[q_{ij}], \,\,\mathbf{QQ^\top}=\mathbf{I}_n && \mathbf{R}=\begin{bmatrix}r_{11} & r_{21} & \cdots & r_{n1}\\0 & r_{22} & \cdots & r_{n2}\\\vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & r_{nn}\end{bmatrix} \end{align*}

where the element of Q\mathbf{Q} can be found via Gram-Schmidt diagonalization.

For a linear system of equations, Ax=b\mathbf{Ax}=\mathbf{b}, apply QR factorization and get QRx=b\mathbf{QRx=b}. Multiply both sides by Q\mathbf{Q}^\topand get Rx=Qb\mathbf{Rx}=\mathbf{Q^\top b}. Let y=Qb\mathbf{y}=\mathbf{Q^\top b}, we can solve Rx=y\mathbf{Rx=y} for x\mathbf{x} by backward substitution, i.e.,

xi=1rii(yij=i+1nrijxj),i=n1,,1\begin{align*}x_i &= \frac{1}{r_{ii}}\left(y_i - \sum_{j=i+1}^{n} r_{ij}x_j\right), \qquad i=n-1,\dots,1\end{align*}

Non-linear Least Squares

In general, let r:RnRm,mn\mathbf{r}:\mathbb{R}^n\to \mathbb{R}^m,m\geq n be a smooth residual function. We can define an objective function f:RnRf:\mathbb{R}^n\to\mathbb{R} as

f(x)=12r(x)2=i=1mri2(x)f(\mathbf{x})=\frac{1}{2}\left\|\mathbf{r}(\mathbf{x})\right\|^2=\sum_{i=1}^{m}r_i^2(\mathbf{x})

The goal is still to find the minimizer of the function ff, i.e., x=arg minxRnf(x)\mathbf{x}^{\star}=\argmin_{\mathbf{x}\in\mathbb{R}^n}f(\mathbf{x}). Unfortunately, there’s no closed-form formula for finding x\mathbf{x}^\star as we did for the LLS problem. Common methods usually start from an initial guess, and iteratively approximate the optimum based on “descent direction” mentioned before.

Specifically, if we start with an initial guess x0\mathbf{x}^0, linearize around x0\mathbf{x}^0, and find a descending direction d\mathbf{d}. Then, we update the guess by moving along direction d\mathbf{d} , and we can expect the objective will become smaller, i.e., xk+1=xk+d\mathbf{x}^{k+1 }=\mathbf{x}^k+\mathbf{d} and f(xk+1)<f(xk)f(\mathbf{x}^{k+1})<f(\mathbf{x}^k). Hopefully, this process will converge to a local minimum. Common methods include Gauss-Newton, Levenberg-Marquardt, or gradient descent, distinguished by how they compute d\mathbf{d}.

Gauss-Newton Method

The steps of the Gauss-Newton method are as follows:

  1. Start from an initial guess x0\mathbf{x}^0, iterate until convergence k=0,1,2,k=0, 1, 2, \dots
  1. Linearize the residual at the current guess xk\mathbf{x}^k using first-order Taylor expansion,
    r(xk+d)=r(xk)+J(xk)d\mathbf{r}(\mathbf{x}^k+\mathbf{d})=\mathbf{r}(\mathbf{x}^k)+\mathbf{J}(\mathbf{x}^k)\mathbf{d}

    where

    J(x):=r(x)x=[r1x1r1x2r1xnr2x1r2x2r2xnrmx1rmx2rmxn]Rm×n\begin{equation*} \mathbf{J}(\mathbf{x}) := \frac{\partial \mathbf{r(x)}}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial r_1}{\partial x_1} & \dfrac{\partial r_1}{\partial x_2} & \cdots & \dfrac{\partial r_1}{\partial x_n}\\[12pt] \dfrac{\partial r_2}{\partial x_1} & \dfrac{\partial r_2}{\partial x_2} & \cdots & \dfrac{\partial r_2}{\partial x_n} \\[6pt] \vdots & \vdots & \ddots & \vdots \\[6pt] \dfrac{\partial r_m}{\partial x_1} & \dfrac{\partial r_m}{\partial x_2} & \cdots & \dfrac{\partial r_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n} \end{equation*}

    is the Jacobian matrix. Let JkJ(xk)\mathbf{J}_k\coloneqq \mathbf{J}(\mathbf{x}^k).

  1. Solve the resulting linear least squares to find the step d=arg mindr(xk)+Jkd2\mathbf{d}^{\star}=\argmin_\mathbf{d}\left\|\mathbf{r}(\mathbf{x}^k)+\mathbf{J}_k\mathbf{d}\right\|^2 by solving
    (JkJk)d=Jkr(xk)\left(\mathbf{J}_k^\top \mathbf{J}_k\right) \mathbf{d}=-\mathbf{J}_k^\top \mathbf{r}\left(\mathbf{x}^k\right)

    As discussed before, this is a linear system of equations, and should be solved by using Cholesky or QR decomposition rather than inverting JkJk\mathbf{J}_k^\top\mathbf{J}_k.

  1. Update the current guess, xk+1=xk+d\mathbf{x}^{k+1}=\mathbf{x}^k+\mathbf{d}.

Implementation example:

def gauss_newton(x0, max_iter, tol):
    x = x0.copy()
    for _ in range(max_iter):
        r = residual(x)
        J = jacobian(x)
        d = np.linalg.solve(J.T @ J, -J.T @ r) # Solve (J^T J) d = -J^T r
        x = x + d
        if np.linalg.norm(d) < tol:
            break
    return x

Gradient Descent

The steps of the gradient descent method are as follows:

  1. Start from an initial guess x0\mathbf{x}^0, iterate until convergence k=0,1,2,k=0, 1, 2, \dots
  1. Compute the gradient of the objective function
    f(xk)=Jkr(xk)\nabla f(\mathbf{x}^k)=\mathbf{J}_k^\top \mathbf{r}(\mathbf{x}^k)
    • Derivation

      Note that

      f(x)=12r(x)2=12i=1mri2(x)2f(\mathbf{x})=\frac{1}{2}\left\|\mathbf{r}(\mathbf{x})\right\|^2=\frac{1}{2}\sum_{i=1}^{m}r_i^2\left(\mathbf{x}\right)^2

      and the jj-th entry of f\nabla f is

      fxj=i=1mri(x)rixj=Jk,jr(x)\frac{\partial f}{\partial x_j}=\sum_{i=1}^{m}r_i\left(\mathbf{x}\right)\frac{\partial r_i}{\partial x_j}=\mathbf{J}_{k,j}^\top\mathbf{r}(\mathbf{x})

      Then, stack all the entries of f\nabla f and we can get the result f(xk)=Jkr(xk)\nabla f(\mathbf{x}^k)=\mathbf{J}_k^\top \mathbf{r}(\mathbf{x}^k).

  1. Let the step d\mathbf{d} be the negative of the gradient, scaled by a factor called the learning rate η\eta, i.e., d=ηJkr(xk)\mathbf{d}=-\eta \mathbf{J}_k^\top \mathbf{r}(\mathbf{x}^k).
  1. Update the current guess, xk+1=xk+d\mathbf{x}^{k+1}=\mathbf{x}^k+\mathbf{d}.

Implementation example:

def gradient_descent(x0, max_iter, tol, eta):
    x = x0.copy()
    for _ in range(max_iter):
        r = residual(x)
        J = jacobian(x)
        d = -eta * J.T @ r # Directly set d to be along the opposite gradient
        x = x + d
        if np.linalg.norm(d) < tol:
            break
    return x

Levenberg-Marquardt Method

The step of the Levenberg-Marquardt method is as follows:

  1. Start from an initial guess x0\mathbf{x}^0, iterate until convergence k=0,1,2,k=0, 1, 2, \dots
  1. Linearize the residual at the current guess r(xk+d)=r(xk)+Jkd\mathbf{r}\left(\mathbf{x}^k+\mathbf{d}\right)=\mathbf{r}\left(\mathbf{x}^k\right)+\mathbf{J}_k\mathbf{d}.
  1. Solve a 2\ell_2 regularized linear least squares problem by d=arg mindr(xk)+Jkd2+λd2\mathbf{d}^\star=\argmin_{\mathbf{d}}\left\|\mathbf{r}\left(\mathbf{x}^k\right)+\mathbf{J}_k\mathbf{d}\right\|^2+\lambda \left\|\mathbf{d}\right\|^2.
    (JkJk+λI)d=Jkr(xk)\left(\mathbf{J}_k^\top \mathbf{J}_k+\lambda \mathbf{I}\right) \mathbf{d}=-\mathbf{J}_k^\top \mathbf{r}\left(\mathbf{x}^k\right)
    • Derivation

      Denote rkr(xk)\mathbf{r}_k\coloneqq \mathbf{r}\left(\mathbf{x}^k\right), the new objective function is

      g(d)=(rk+Jkd)(rk+Jkd)+λdd=rkrk+rkJkd+dJkrk+dJkJkd+λdd=rkrk+2rkJkd+dJkJkd+λdd\begin{align*} g(\mathbf{d}) &= (\mathbf{r}_k + \mathbf{J}_k \mathbf{d})^\top(\mathbf{r}_k + \mathbf{J}_k \mathbf{d}) + \lambda\,\mathbf{d}^\top\mathbf{d}\\ &= \mathbf{r}_k^\top\mathbf{r}_k + \mathbf{r}_k^\top\mathbf{J}_k\mathbf{d} + \mathbf{d}^\top\mathbf{J}_k^\top\mathbf{r}_k + \mathbf{d}^\top\mathbf{J}_k^\top\mathbf{J}_k\mathbf{d} + \lambda\mathbf{d}^\top\mathbf{d}\\ &= \mathbf{r}_k^\top\mathbf{r}_k + 2\,\mathbf{r}_k^\top\mathbf{J}_k\mathbf{d} + \mathbf{d}^\top\mathbf{J}_k^\top\mathbf{J}_k\mathbf{d} + \lambda\,\mathbf{d}^\top\mathbf{d} \end{align*}

      Take the gradient of gg, and let g(d)=0\nabla g(\mathbf{d})=\mathbf{0}

      g(d)=2Jkrk+2JkJkd+2λd=0    (JkJk+λI)d=Jkr(xk)\begin{align*} \nabla g(\mathbf{d}) = 2\,\mathbf{J}_k^\top\mathbf{r}_k + 2\,\mathbf{J}_k^\top\mathbf{J}_k\,\mathbf{d} + 2\lambda\,\mathbf{d} = \mathbf{0} && \implies && \boxed{(\mathbf{J}_k^\top\mathbf{J}_k + \lambda\mathbf{I})\mathbf{d} = -\mathbf{J}_k^\top\mathbf{r}(\mathbf{x}^k)} \end{align*}
  1. Update the current guess, xk+1=xk+d\mathbf{x}^{k+1}=\mathbf{x}^k+\mathbf{d}.

Implementation example:

def levenberg_marquardt(x0, max_iter, tol, lam):
    x = x0.copy()
    for _ in range(max_iter):
        r = residual(x)
        J = jacobian(x)
        A = J.T @ J + lam * np.eye(len(x))
        d = np.linalg.solve(A, -J.T @ r) # Solve (J^T J + λI) d = -J^T r
        x = x + d
        if np.linalg.norm(d) < tol:
            break
    return x

Note that the Levenberg-Marquardt method is an interpolation between Gauss-Newton and gradient descent.

  • As λ0\lambda\to 0, it reduces to Gauss-Newton: (JkJk+λI)JkJk\left(\mathbf{J}_k^\top \mathbf{J}_k+\lambda \mathbf{I}\right) \to \mathbf{J}_k^\top \mathbf{J}_k.
  • As λ\lambda \to \infty, it reduces to gradient descent: (JkJk+λI)λI\left(\mathbf{J}_k^\top \mathbf{J}_k+\lambda \mathbf{I}\right) \to \lambda \mathbf{I}.

Example

Now we compare these three methods using an example. Consider the residual and objective as

r(x)=[1x110(x2x12)],f(x)=12[(1x1)2+100(x2x12)2]\begin{align*} \mathbf{r}(\mathbf{x})=\begin{bmatrix} 1-x_1\\ 10(x_2-x_1^2) \end{bmatrix}, && f(\mathbf{x})=\frac{1}{2}\left[(1-x_1)^2+100(x_2-x_1^2)^2\right] \end{align*}

The Jacobian is

J(x)=[1020x110]\mathbf{J}(\mathbf{x})=\begin{bmatrix} -1 & 0\\ -20x_1 & 10 \end{bmatrix}

We can visualize the steps taken by each of the methods

The convergence rates of these three methods are illustrated in the following figure.

with qualitative summary (max iteration set to 1000 and tolerance to 1e-12).

  Gauss-Newton:               3 iterations → f = 0.00e+00
  Levenberg-Marquardt:      144 iterations → f = 2.50e-24
  Gradient Descent:        1000 iterations → f = 2.27e-01

Categories: Mathematics