Optimization Part 1 Basics and Least Squares

11 minute read ·

Published: March 31, 2026

In common engineering problems, the optimization problem is to minimize the residual of a parametric model with respect to some observed data points.

Preliminaries

Given $N$ measurements $\left\{\left(\mathbf{x}_i, \mathbf{y}_i\right)\right\}_{i=1}^N$ where $\mathbf{x}_i\in\mathbb{R}^p$ and $\mathbf{y}_i\in\mathbb{R}^{m_i}$ . We define a prediction model parametrized by $\boldsymbol{\theta}\in\mathbb{R}^n$ as $\mathbf{h}\left(\cdot;\boldsymbol{\theta}\right):\mathbb{R}^p\to\mathbb{R}^{m_i}$ and the residual of a single datapoint as $\mathbf{r}_i:\mathbb{R}^n\to\mathbb{R}^{m_i}$ , where $\mathbf{r}_i\left(\boldsymbol{\theta}\right)\coloneqq \mathbf{h}\left(\mathbf{x}_i;\boldsymbol{\theta}\right)-\mathbf{y}_i$ . The full residual $\mathbf{r}:\mathbb{R}^{n}\to\mathbb{R}^m$ is simply a stack of every single residuals, i.e.,

\begin{align*} \mathbf{r}\left(\boldsymbol{\theta}\right)= \begin{bmatrix} \mathbf{r}_1\left(\boldsymbol{\theta}\right)\\ \mathbf{r}_2\left(\boldsymbol{\theta}\right)\\ \vdots \\ \mathbf{r}_N\left(\boldsymbol{\theta}\right)\\ \end{bmatrix} \in\mathbb{R}^m, && m=\sum_{i=1}^{N}m_i \end{align*}

Note that we require 𝑚≥𝑛 because otherwise the prediction model cannot be fully defined, i.e., there are infinitely many 𝜃 that perfectly fit the data.

The goal of an optimization problem is to find a model that best fits the data. To formalize this, we need to define some function $f:\mathbb{R}^n\to\mathbb{R}_{\geq 0}$ that properly maps the residual vector to a minimizable scalar, which is known as the objective function. In general, the function $f$ is assumed to be continuously differentiable. Some of the common objective functions are

Least Squares

f(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2 = \frac{1}{2}\sum_{i=1}^{N}\|\mathbf{r}_i(\boldsymbol{\theta})\|^2

Least Squares with $\ell_1$ regularization

f(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2 + \lambda\|\boldsymbol{\theta}\|_1 = \frac{1}{2}\sum_{i=1}^{N}\|\mathbf{r}_i(\boldsymbol{\theta})\|^2 + \lambda\sum_{j=1}^{n}|\theta_j|

Least Squares with $\ell_2$ regularization

f(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2 + \frac{\lambda}{2}\|\boldsymbol{\theta}\|^2= \frac{1}{2}\sum_{i=1}^{N}\|\mathbf{r}_i(\boldsymbol{\theta})\|^2 + \frac{\lambda}{2}\sum_{j=1}^{n}\theta_j^2

Weighted Least Squares

f(\boldsymbol{\theta}) = \frac{1}{2}\mathbf{r}(\boldsymbol{\theta})^\top \mathbf{W}\mathbf{r}(\boldsymbol{\theta}) = \frac{1}{2}\sum_{i=1}^{N} \mathbf{r}_i(\boldsymbol{\theta})^\top w_i\mathbf{r}_i(\boldsymbol{\theta})

Cauchy Loss

\begin{align*} f(\boldsymbol{\theta}) = \sum_{i=1}^{N} \rho\!\left(\|\mathbf{r}_i (\boldsymbol{\theta})\|\right), && \rho(r) = \frac{\alpha^2}{2}\log\!\left(1 + \frac{r^2}{\alpha^2}\right) \end{align*}

The following is a list of some of the theorems related to optimality conditions. Suppose $f$ is differentiable, and define the gradient (first-order derivative) and Hessian (second-order derivative) as

\begin{align*} \nabla f(\boldsymbol{\theta}) \coloneqq \begin{bmatrix} \dfrac{\partial f}{\partial \theta_1} \\[10pt] \vdots \\[6pt] \dfrac{\partial f}{\partial \theta_n} \end{bmatrix}\in\mathbb{R}^n &&\text{and}&& \mathbf{H}(\boldsymbol{\theta}) \coloneqq \begin{bmatrix} \dfrac{\partial^2 f}{\partial \theta_1^2} & \dfrac{\partial^2 f}{\partial \theta_1 \partial \theta_2} & \cdots & \dfrac{\partial^2 f}{\partial \theta_1 \partial \theta_n} \\[12pt] \dfrac{\partial^2 f}{\partial \theta_2 \partial \theta_1} & \dfrac{\partial^2 f}{\partial \theta_2^2} & \cdots & \dfrac{\partial^2 f}{\partial \theta_2 \partial \theta_n} \\[12pt] \vdots & \vdots & \ddots & \vdots \\[6pt] \dfrac{\partial^2 f}{\partial \theta_n \partial \theta_1} & \dfrac{\partial^2 f}{\partial \theta_n \partial \theta_2} & \cdots & \dfrac{\partial^2 f}{\partial \theta_n^2} \end{bmatrix}\in\mathbb{S}^n \end{align*}

Descent direction: If there is a vector $\mathbf{d}\in\mathbb{R}^n$ such that $\nabla f(\boldsymbol{\theta}_0)^\top\mathbf{d}<0$ , then for all sufficiently small $\lambda>0$ , $f(\boldsymbol{\theta}_0+\lambda\mathbf{d})<f(\boldsymbol{\theta}_0)$ . Here, $\mathbf{d}$ is a descent direction of $f$ at $\boldsymbol{\theta}_0$ .
Intuitively, the dot product less than 0 means that vector $\mathbf{d}$ is in the opposite direction as the gradient.

First-order necessary condition: If $\boldsymbol{\theta}^*$ is a local minimum, then $\nabla f(\boldsymbol{\theta}^*)=0$ .

Second-order necessary condition: If $\boldsymbol{\theta}^*$ is a local minimum, then $\nabla f(\boldsymbol{\theta}^*)=0$ and $\mathbf{H}(\boldsymbol{\theta}^*)\succeq 0$ (positive semi-definite).

Second-order sufficient condition: If $\nabla f(\boldsymbol{\theta}^*)=0$ and $\mathbf{H}(\boldsymbol{\theta}^*)\succ 0$ (positive definite), then $\boldsymbol{\theta}^*$ is a local minimum.

Convex function: $f$ is convex if and only if $\mathbf{H}(\boldsymbol{\theta})\succeq 0$ for all $\boldsymbol{\theta}$ .

Global minimum: Suppose $f$ is convex, then $\boldsymbol{\theta}^\star$ is a global minimum if an only if $\nabla f(\boldsymbol{\theta}^\star)=0$ .

Linear Least Squares

Let $\mathbf{A}\in\mathbb{R}^{m\times n}$ be a matrix and $\mathbf{b}\in\mathbb{R}^m$ be a vector, where $m>n$ and $\mathrm{rank}(\mathbf{A})=n$ . The Linear Least Squares (LLS) optimization problem is to find the optimal solution $\mathbf{x}\in\mathbb{R}^n$ to the linear system of equations $\mathbf{A}\mathbf{x}=\mathbf{b}$ . Define the objective function as

f(\mathbf{x})=\frac{1}{2}\left\|\mathbf{Ax-b}\right\|^2

and the goal is to find $\mathbf{x}^{\star}=\argmin_{\mathbf{x}\in\mathbb{R}^n}f(\mathbf{x})$ .

First, we can write out the gradient and Hessian matrix of $f$ , which are

\begin{align*} \nabla f(\mathbf{x})=\mathbf{A^\top A}\mathbf{x}-\mathbf{A^\top b} && \mathbf{H}(\mathbf{x})=\mathbf{A^\top A} \end{align*}

Derivation
Simply expand the objective function,
$\begin{align*} f(\mathbf{x})&=\frac{1}{2}\left(\mathbf{Ax-b}\right)^\top\left(\mathbf{Ax-b}\right)\\ &=\frac{1}{2}\left(\mathbf{x}^\top\mathbf{A}^\top-\mathbf{b}^\top\right)\left(\mathbf{Ax-b}\right)\\ &=\frac{1}{2}\left(\mathbf{x^\top A^\top Ax}-\mathbf{x^\top A^\top b}-\mathbf{b^\top Ax}+\mathbf{b^\top b}\right)\\ &=\frac{1}{2}\mathbf{x^\top A^\top Ax}-\mathbf{x^\top A^\top b}+\frac{1}{2}\mathbf{b^\top b} \end{align*}$
Note that the last step is true because $b^\top Ax\in\mathbb{R}$ and $\mathbf{b^\top Ax=(b^\top Ax)^\top=x^\top A^\top b}$ .

Further notice that because of $\mathrm{rank}(\mathbf{A})=n$ , the Hessian matrix is positive definite (and thus invertible), i.e., $\mathbf{A^\top A}\succ 0$ . Then $\mathbf{x}^\star$ is the global minimum if and only if $\nabla f(\mathbf{x}^\star)=0$ ,

\begin{align*} \mathbf{A^\top A x}^\star-\mathbf{A^\top b}=0 && \implies &&\boxed{\mathbf{x^\star}=\left(\mathbf{A^\top A}\right)^{-1}\mathbf{A^\top b}} \end{align*}

Although the above equation gives a closed-form formula for finding the global minimum $\mathbf{x}^\star$ , computing the inverse of $\mathbf{A^\top A}$ is not practical. Instead, we solve a linear system of equations

\begin{equation} \left(\mathbf{A^\top A}\right)\mathbf{x}^\star = \mathbf{A^\top b} \end{equation}

which is known as the normal equations.

Therefore, solving an LLS problem is basically solving a linear system of equations. Typical methods include Cholesky decomposition and QR factorization, rather than directly inverting $\mathbf{A^\top A}$ .

Cholesky Solver

The Cholesky decomposition of a positive-definite matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ is a decomposition into the product of an upper triangular matrix and a lower triangular matrix, i.e., $\mathbf{A}=\mathbf{LL^\top}$ , where

\begin{align*} \mathbf{A}=\begin{bmatrix}a_{11} & a_{12} & \cdots & a_{1n}\\a_{21} & a_{22} & \cdots & a_{2n}\\\vdots & \vdots & \ddots & \vdots\\a_{n1} & a_{n2} & \cdots & a_{nn}\end{bmatrix} &&\mathbf{L}=\begin{bmatrix}l_{11} & 0 & \cdots & 0\\l_{21} & l_{22} & \cdots & 0\\\vdots & \vdots & \ddots & \vdots\\l_{n1} & l_{n2} & \cdots & l_{nn}\end{bmatrix} &&\mathbf{L^\top}=\begin{bmatrix}l_{11} & l_{21} & \cdots & l_{n1}\\0 & l_{22} & \cdots & l_{n2}\\\vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & l_{nn}\end{bmatrix} \end{align*}

The elements of $\mathbf{L}$ can be obtained as follows

\begin{align*}l_{ii} &= \sqrt{a_{ii}-\sum_{k=1}^{i-1} l_{ik}^2}, \qquad i=1,\dots,n\\l_{ji} &= \frac{1}{l_{ii}}\left(a_{ji}-\sum_{k=1}^{i-1} l_{jk}l_{ik}\right), \qquad j=i+1,\dots,n\end{align*}

For a linear system of equations, $\mathbf{Ax}=\mathbf{b}$ , apply Cholesky decomposition and get $\mathbf{LL^\top x=b}$ . Then, we first let $\mathbf{y}=\mathbf{L^\top x}$ , and solve $\mathbf{Ly=b}$ for $\mathbf{y}$ using forward substitution ( $y_1\to y_2\to\cdots \to y_n$ ). Specifically,

\begin{align*} \mathbf{Ly=b} && \implies && \begin{bmatrix}l_{11} & 0 & \cdots & 0\\l_{21} & l_{22} & \cdots & 0\\\vdots & \vdots & \ddots & \vdots\\l_{n1} & l_{n2} & \cdots & l_{nn}\end{bmatrix} \begin{bmatrix} y_1\\y_2\\ \vdots \\y_n \end{bmatrix}= \begin{bmatrix} b_1\\ b_2\\ \vdots \\b_n \end{bmatrix} \end{align*}

Note that $y_1$ can be easily found as $y_1 = b_1/l_{11}$ , and substituting $y_1$ into the second equation solves $y_2$ . In general,

\begin{align*} y_i &= \frac{1}{l_{ii}}\left(b_i - \sum_{j=1}^{i-1} l_{ij}y_j\right),\qquad i=1,\dots,n \end{align*}

Finally, solve $\mathbf{L^\top x}=\mathbf{y}$ for $\mathbf{x}$ using backward substitution ( $x_n\to x_{n-1}\to\cdots\to x_1$ ). Specifically,

\begin{align*} \mathbf{L^\top x=y} && \implies && \begin{bmatrix} l_{11} & l_{21} & \cdots & l_{n1}\\ 0 & l_{22} & \cdots & l_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & l_{nn} \end{bmatrix} \begin{bmatrix} x_1\\x_2\\ \vdots \\x_n \end{bmatrix} = \begin{bmatrix} y_1\\ y_2\\ \vdots \\y_n \end{bmatrix} \end{align*}

Now, $x_n$ can be easily found as $x_n=y_n/l_{nn}$ , and substituting $x_n$ into the second to last equation solves $x_{n-1}$ . In general,

\begin{align*}x_i &= \frac{1}{l_{ii}}\left(y_i - \sum_{j=i+1}^{n} l_{ji}x_j\right),\qquad i=n,\dots,1\end{align*}

In practice, we can use numpy.linalg.cholesky or similar packages to perform Cholesky decomposition. The time complexity is $O(n^3)$ .

QR Solver

The QR factorization of a matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ is a decomposition into an orthogonal matrix and an upper triangular matrix, i.e., $\mathbf{A}=\mathbf{QR}$ , where

\begin{align*} \mathbf{Q}=[q_{ij}], \,\,\mathbf{QQ^\top}=\mathbf{I}_n && \mathbf{R}=\begin{bmatrix}r_{11} & r_{21} & \cdots & r_{n1}\\0 & r_{22} & \cdots & r_{n2}\\\vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & r_{nn}\end{bmatrix} \end{align*}

where the element of $\mathbf{Q}$ can be found via Gram-Schmidt diagonalization.

For a linear system of equations, $\mathbf{Ax}=\mathbf{b}$ , apply QR factorization and get $\mathbf{QRx=b}$ . Multiply both sides by $\mathbf{Q}^\top$ and get $\mathbf{Rx}=\mathbf{Q^\top b}$ . Let $\mathbf{y}=\mathbf{Q^\top b}$ , we can solve $\mathbf{Rx=y}$ for $\mathbf{x}$ by backward substitution, i.e.,

\begin{align*}x_i &= \frac{1}{r_{ii}}\left(y_i - \sum_{j=i+1}^{n} r_{ij}x_j\right), \qquad i=n-1,\dots,1\end{align*}

Non-linear Least Squares

In general, let $\mathbf{r}:\mathbb{R}^n\to \mathbb{R}^m,m\geq n$ be a smooth residual function. We can define an objective function $f:\mathbb{R}^n\to\mathbb{R}$ as

f(\mathbf{x})=\frac{1}{2}\left\|\mathbf{r}(\mathbf{x})\right\|^2=\sum_{i=1}^{m}r_i^2(\mathbf{x})

The goal is still to find the minimizer of the function $f$ , i.e., $\mathbf{x}^{\star}=\argmin_{\mathbf{x}\in\mathbb{R}^n}f(\mathbf{x})$ . Unfortunately, there’s no closed-form formula for finding $\mathbf{x}^\star$ as we did for the LLS problem. Common methods usually start from an initial guess, and iteratively approximate the optimum based on “descent direction” mentioned before.

Specifically, if we start with an initial guess $\mathbf{x}^0$ , linearize around $\mathbf{x}^0$ , and find a descending direction $\mathbf{d}$ . Then, we update the guess by moving along direction $\mathbf{d}$ , and we can expect the objective will become smaller, i.e., $\mathbf{x}^{k+1 }=\mathbf{x}^k+\mathbf{d}$ and $f(\mathbf{x}^{k+1})<f(\mathbf{x}^k)$ . Hopefully, this process will converge to a local minimum. Common methods include Gauss-Newton, Levenberg-Marquardt, or gradient descent, distinguished by how they compute $\mathbf{d}$ .

Gauss-Newton Method

The steps of the Gauss-Newton method are as follows:

Start from an initial guess $\mathbf{x}^0$ , iterate until convergence $k=0, 1, 2, \dots$

Linearize the residual at the current guess $\mathbf{x}^k$ using first-order Taylor expansion,
$\mathbf{r}(\mathbf{x}^k+\mathbf{d})=\mathbf{r}(\mathbf{x}^k)+\mathbf{J}(\mathbf{x}^k)\mathbf{d}$
where
$\begin{equation*} \mathbf{J}(\mathbf{x}) := \frac{\partial \mathbf{r(x)}}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial r_1}{\partial x_1} & \dfrac{\partial r_1}{\partial x_2} & \cdots & \dfrac{\partial r_1}{\partial x_n}\\[12pt] \dfrac{\partial r_2}{\partial x_1} & \dfrac{\partial r_2}{\partial x_2} & \cdots & \dfrac{\partial r_2}{\partial x_n} \\[6pt] \vdots & \vdots & \ddots & \vdots \\[6pt] \dfrac{\partial r_m}{\partial x_1} & \dfrac{\partial r_m}{\partial x_2} & \cdots & \dfrac{\partial r_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n} \end{equation*}$
is the Jacobian matrix. Let $\mathbf{J}_k\coloneqq \mathbf{J}(\mathbf{x}^k)$ .

Solve the resulting linear least squares to find the step $\mathbf{d}^{\star}=\argmin_\mathbf{d}\left\|\mathbf{r}(\mathbf{x}^k)+\mathbf{J}_k\mathbf{d}\right\|^2$ by solving
$\left(\mathbf{J}_k^\top \mathbf{J}_k\right) \mathbf{d}=-\mathbf{J}_k^\top \mathbf{r}\left(\mathbf{x}^k\right)$
As discussed before, this is a linear system of equations, and should be solved by using Cholesky or QR decomposition rather than inverting $\mathbf{J}_k^\top\mathbf{J}_k$ .

Update the current guess, $\mathbf{x}^{k+1}=\mathbf{x}^k+\mathbf{d}$ .

Implementation example:

def gauss_newton(x0, max_iter, tol):
    x = x0.copy()
    for _ in range(max_iter):
        r = residual(x)
        J = jacobian(x)
        d = np.linalg.solve(J.T @ J, -J.T @ r) # Solve (J^T J) d = -J^T r
        x = x + d
        if np.linalg.norm(d) < tol:
            break
    return x

Gradient Descent

The steps of the gradient descent method are as follows:

Start from an initial guess $\mathbf{x}^0$ , iterate until convergence $k=0, 1, 2, \dots$

Compute the gradient of the objective function
$\nabla f(\mathbf{x}^k)=\mathbf{J}_k^\top \mathbf{r}(\mathbf{x}^k)$
- Derivation
  Note that
  $f(\mathbf{x})=\frac{1}{2}\left\|\mathbf{r}(\mathbf{x})\right\|^2=\frac{1}{2}\sum_{i=1}^{m}r_i^2\left(\mathbf{x}\right)^2$
  and the $j$ -th entry of $\nabla f$ is
  $\frac{\partial f}{\partial x_j}=\sum_{i=1}^{m}r_i\left(\mathbf{x}\right)\frac{\partial r_i}{\partial x_j}=\mathbf{J}_{k,j}^\top\mathbf{r}(\mathbf{x})$
  Then, stack all the entries of $\nabla f$ and we can get the result $\nabla f(\mathbf{x}^k)=\mathbf{J}_k^\top \mathbf{r}(\mathbf{x}^k)$ .

Let the step $\mathbf{d}$ be the negative of the gradient, scaled by a factor called the learning rate $\eta$ , i.e., $\mathbf{d}=-\eta \mathbf{J}_k^\top \mathbf{r}(\mathbf{x}^k)$ .

Update the current guess, $\mathbf{x}^{k+1}=\mathbf{x}^k+\mathbf{d}$ .

Implementation example:

def gradient_descent(x0, max_iter, tol, eta):
    x = x0.copy()
    for _ in range(max_iter):
        r = residual(x)
        J = jacobian(x)
        d = -eta * J.T @ r # Directly set d to be along the opposite gradient
        x = x + d
        if np.linalg.norm(d) < tol:
            break
    return x

Levenberg-Marquardt Method

The step of the Levenberg-Marquardt method is as follows:

Start from an initial guess $\mathbf{x}^0$ , iterate until convergence $k=0, 1, 2, \dots$

Linearize the residual at the current guess $\mathbf{r}\left(\mathbf{x}^k+\mathbf{d}\right)=\mathbf{r}\left(\mathbf{x}^k\right)+\mathbf{J}_k\mathbf{d}$ .

Solve a $\ell_2$ regularized linear least squares problem by $\mathbf{d}^\star=\argmin_{\mathbf{d}}\left\|\mathbf{r}\left(\mathbf{x}^k\right)+\mathbf{J}_k\mathbf{d}\right\|^2+\lambda \left\|\mathbf{d}\right\|^2$ .
$\left(\mathbf{J}_k^\top \mathbf{J}_k+\lambda \mathbf{I}\right) \mathbf{d}=-\mathbf{J}_k^\top \mathbf{r}\left(\mathbf{x}^k\right)$
- Derivation
  Denote $\mathbf{r}_k\coloneqq \mathbf{r}\left(\mathbf{x}^k\right)$ , the new objective function is
  $\begin{align*} g(\mathbf{d}) &= (\mathbf{r}_k + \mathbf{J}_k \mathbf{d})^\top(\mathbf{r}_k + \mathbf{J}_k \mathbf{d}) + \lambda\,\mathbf{d}^\top\mathbf{d}\\ &= \mathbf{r}_k^\top\mathbf{r}_k + \mathbf{r}_k^\top\mathbf{J}_k\mathbf{d} + \mathbf{d}^\top\mathbf{J}_k^\top\mathbf{r}_k + \mathbf{d}^\top\mathbf{J}_k^\top\mathbf{J}_k\mathbf{d} + \lambda\mathbf{d}^\top\mathbf{d}\\ &= \mathbf{r}_k^\top\mathbf{r}_k + 2\,\mathbf{r}_k^\top\mathbf{J}_k\mathbf{d} + \mathbf{d}^\top\mathbf{J}_k^\top\mathbf{J}_k\mathbf{d} + \lambda\,\mathbf{d}^\top\mathbf{d} \end{align*}$
  Take the gradient of $g$ , and let $\nabla g(\mathbf{d})=\mathbf{0}$
  $\begin{align*} \nabla g(\mathbf{d}) = 2\,\mathbf{J}_k^\top\mathbf{r}_k + 2\,\mathbf{J}_k^\top\mathbf{J}_k\,\mathbf{d} + 2\lambda\,\mathbf{d} = \mathbf{0} && \implies && \boxed{(\mathbf{J}_k^\top\mathbf{J}_k + \lambda\mathbf{I})\mathbf{d} = -\mathbf{J}_k^\top\mathbf{r}(\mathbf{x}^k)} \end{align*}$

Update the current guess, $\mathbf{x}^{k+1}=\mathbf{x}^k+\mathbf{d}$ .

Implementation example:

def levenberg_marquardt(x0, max_iter, tol, lam):
    x = x0.copy()
    for _ in range(max_iter):
        r = residual(x)
        J = jacobian(x)
        A = J.T @ J + lam * np.eye(len(x))
        d = np.linalg.solve(A, -J.T @ r) # Solve (J^T J + λI) d = -J^T r
        x = x + d
        if np.linalg.norm(d) < tol:
            break
    return x

Note that the Levenberg-Marquardt method is an interpolation between Gauss-Newton and gradient descent.

As $\lambda\to 0$ , it reduces to Gauss-Newton: $\left(\mathbf{J}_k^\top \mathbf{J}_k+\lambda \mathbf{I}\right) \to \mathbf{J}_k^\top \mathbf{J}_k$ .

As $\lambda \to \infty$ , it reduces to gradient descent: $\left(\mathbf{J}_k^\top \mathbf{J}_k+\lambda \mathbf{I}\right) \to \lambda \mathbf{I}$ .

Example

Now we compare these three methods using an example. Consider the residual and objective as

\begin{align*} \mathbf{r}(\mathbf{x})=\begin{bmatrix} 1-x_1\\ 10(x_2-x_1^2) \end{bmatrix}, && f(\mathbf{x})=\frac{1}{2}\left[(1-x_1)^2+100(x_2-x_1^2)^2\right] \end{align*}

The Jacobian is

\mathbf{J}(\mathbf{x})=\begin{bmatrix} -1 & 0\\ -20x_1 & 10 \end{bmatrix}

We can visualize the steps taken by each of the methods

The convergence rates of these three methods are illustrated in the following figure.

with qualitative summary (max iteration set to 1000 and tolerance to 1e-12).

  Gauss-Newton:               3 iterations → f = 0.00e+00
  Levenberg-Marquardt:      144 iterations → f = 2.50e-24
  Gradient Descent:        1000 iterations → f = 2.27e-01

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

You May Also Enjoy

Occupancy Grid Mapping

March 22, 2026 Robotics

Besides state estimation or localization, which provide a robot with knowledge of where it is, it’s equally important for a mobile robot to perceive its surr...

Lie Theory in Robot Motion

March 10, 2026 Robotics

This note covers some of the fundamental concepts in Lie group and Lie algebra, and their applications to representing rigid body motion in robotics. A group...

Bayes Filtering and State Estimation

March 4, 2026 Robotics

In robot state estimation, the Bayes filter is a probabilistic approach that estimates the state from a sequence of controls and measurements by recursively ...

Acoustic Damping with a Helmholtz Resonator

August 6, 2024 Physics

Two fundamental equations in acoustics areContinuity EquationThe continuity equation states that the rate at which mass enters a system is equal to the rate ...

Guanyu Xu

Preliminaries

Linear Least Squares

Cholesky Solver

QR Solver

Non-linear Least Squares

Gauss-Newton Method

Gradient Descent

Levenberg-Marquardt Method

Example

Share on

You May Also Enjoy