require(fitdistrplus)
require(actuar)
require(knitr)

1 Quick overview of main optimization methods

For a good introduction to books such as Nocedal & Wright (2006) or Bonnans, Gilbert, Lemarechal & Sagastiz'abal (2006). We consider the problem \(\min_x f(x)\) for \(x\in\mathbb{R}^n\).

1.1 Derivative-free optimization methods

The Nelder-Mead method is one of the most well known derivative-free methods that use only values of \(f\) to search for the minimum. It consists in building a simplex of \(n+1\) points and moving/shrinking this simplex into the good direction.

The Nelder-Mead method is available in optim. By default, in optim, \(\alpha=1\), \(\beta=1/2\), \(\gamma=2\) and \(\sigma=1/2\).

1.2 Hessian-free optimization methods

For smooth non-linear function, the following method is generally used: a local method combined with line search work on the scheme \(x_{k+1} =x_k + t_k d_{k}\), where the local method will specify the direction \(d_k\) and the line search will specify the step size \(t_k \in \mathbb{R}\).

1.2.1 Computing the direction \(d_k\)

A desirable property for \(d_k\) is that \(d_k\) ensures a descent \(f(x_{k+1}) < f(x_{k})\). Newton methods are such that \(d_k\) minimizes a local quadratic approximation of \(f\) based on a Taylor expansion, that is \(q_f(d) = f(x_k) + g(x_k)^Td +\frac{1}{2} d^T H(x_k) d\) where \(g\) denotes the gradient and \(H\) denotes the Hessian.

The consists in using the exact solution of local minimization problem \(d_k = - H(x_k)^{-1} g(x_k)\).
In practice, other methods are preferred (at least to ensure positive definiteness). The method approximates the Hessian by a matrix \(H_k\) as a function of \(H_{k-1}\), \(x_k\), \(f(x_k)\) and then \(d_k\) solves the system \(H_k d = - g(x_k)\). Some implementation may also directly approximate the inverse of the Hessian \(W_k\) in order to compute \(d_k = -W_k g(x_k)\). Using the Sherman-Morrison-Woodbury formula, we can switch between \(W_k\) and \(H_k\).

To determine \(W_k\), first it must verify the secant equation \(H_k y_k =s_k\) or \(y_k=W_k s_k\) where \(y_k = g_{k+1}-g_k\) and \(s_k=x_{k+1}-x_k\). To define the \(n(n-1)\) terms, we generally impose a symmetry and a minimum distance conditions. We say we have a rank 2 update if \(H_k = H_{k-1} + a u u^T + b v v^T\) and a rank 1 update if \(H_k = H_{k-1} + a u u^T \). Rank \(n\) update is justified by the spectral decomposition theorem.

There are two rank-2 updates which are symmetric and preserve positive definiteness

In R, the so-called BFGS scheme is implemented in optim. The DFP method is not available?

Another possible method (which is initially arised from quadratic problems) is the nonlinear conjugate gradients. This consists in computing directions \((d_0, \dots, d_k)\) that are conjugate with respect to a matrix close to the true Hessian \(H(x_k)\). Directions are computed iteratively by \(d_k = -g(x_k) + \beta_k d_{k-1}\) for \(k>1\), once initiated by \(d_1 = -g(x_1)\). \(\beta_k\) are updated according a scheme:

There exists also three-term formula for computing direction \(d_k = -g(x_k) + \beta_k d_{k-1}+\gamma_{k} d_t\) for \(t<k\). A possible scheme is the Beale-Sorenson update defined as \(\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{d^T_{k-1}(g_{k}- g_{k-1})}\) and \(\gamma_k = \frac{ g_k^T (g_{t+1}-g_{t} )}{d^T_{t}(g_{t+1}- g_{t})}\) if \(k>t+1\) otherwise \(\gamma_k=0\) if \(k=t\). See Yuan (2006) for other well-known schemes such as Hestenses-Stiefel, Dixon or Conjugate-Descent. This three updates of the (non-linear) conjugate gradient are available in optim.

1.2.2 Computing the stepsize \(t_k\)

Let \(\phi_k(t) = f(x_k + t d_k)\) for a given direction/iterate \((d_k, x_k)\). We need to find conditions to find a satisfactory stepsize \(t_k\). In literature, we consider the descent condition: \(\phi_k'(0) < 0\) and the Armijo condition: \(\phi_k(t) \leq \phi_k(0) + t c_1 \phi_k'(0)\) ensures a decrease of \(f\). Nocedal & Wright (2006) presents a backtracking (or geometric) approach satisfying the Armijo condition and minimal condition, i.e. Goldstein and Price.

This backtracking linesearch is available in optim.

1.3 Benchmark

To simplify the benchmark of optimization methods, we create a fitbench function that computes the desired estimation method for all optimization methods. This function is currently not exported.

fitbench <- function(data, distr, method, grad=NULL, control=list(trace=0, REPORT=1, maxit=1000), lower=-Inf, upper=+Inf, ...) 

2 Log-likelihood function and its gradient for beta distribution

2.1 Theoretical value

The density of the beta distribution is given by \[ f(x; \delta_1,\delta_2) = \frac{x^{\delta_1-1}(1-x)^{\delta_2-1}}{\beta(\delta_1,\delta_2)}, \] where \(\beta\) denotes the beta function, see the NIST Handbook of mathematical functions. We recall that \(\beta(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b)\). There the log-likelihood for a set of observations \((x_1,\dots,x_n)\) is \[ \log L(\delta_1,\delta_2) = (\delta_1-1)\sum_{i=1}^n\log(x_i)+ (\delta_2-1)\sum_{i=1}^n\log(1-x_i)+ n \log(\beta(\delta_1,\delta_2)) \] The gradient with respect to \(a\) and \(b\) is \[ \nabla \log L(\delta_1,\delta_2) = \left(\begin{matrix} \sum\limits_{i=1}^n\ln(x_i) - n\psi(\delta_1)+n\psi( \delta_1+\delta_2) \\ \sum\limits_{i=1}^n\ln(1-x_i)- n\psi(\delta_2)+n\psi( \delta_1+\delta_2) \end{matrix}\right), \] where \(\psi(x)=\Gamma'(x)/\Gamma(x)\) is the digamma function, see the NIST Handbook of mathematical functions.

2.2 R implementation

As in the fitdistrplus package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in grlnL.

lnL <- function(par, fix.arg, obs, ddistnam) 
  fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam) 
grlnlbeta <- fitdistrplus:::grlnlbeta

3 Numerical application to beta distribution

3.1 Random generation of a sample

#(1) beta distribution
n <- 200
set.seed(12345)
x <- rbeta(n, 3, 3/4)
grlnlbeta(c(3, 4), x) #test
## [1] -132.5493  317.2893
hist(x, prob=TRUE)
lines(density(x), col="red")
curve(dbeta(x, 3, 3/4), col="green", add=TRUE)
legend("topleft", lty=1, col=c("red","green"), leg=c("empirical", "theoretical"))

3.2 Fit Beta distribution

Define control parameters.

ctr <- list(trace=0, REPORT=1, maxit=1000)

Call mledist with the default optimization function (optim implemented in stats package) with and without the gradient for the different optimization methods.

unconstropt <- fitbench(x, "beta", "mle", grad=grlnlbeta)

In the case of constrained optimization, mledist permits the direct use of constrOptim function (still implemented in stats package) that allow linear inequality constraints by using a logarithmic barrier.

Use a exp/log transformation of the shape parameters \(\delta_1\) and \(\delta_2\) to ensure that the shape parameters are strictly positive.

dbeta2 <- function(x, shape1, shape2, log)
  dbeta(x, exp(shape1), exp(shape2), log=log)
#take the log of the starting values
startarg <- lapply(fitdistrplus:::start.arg.default(x, "beta"), log)
#redefine the gradient for the new parametrization
grbetaexp <- function(par, obs, ...) 
    grlnlbeta(exp(par), obs) * exp(par)
    

expopt <- fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) 
#get back to original parametrization
expopt[c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ])

Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one).

3.3 Results of numerical investigation

Results are displayed in the following tables.

BFGS NM CGFR CGPR CGBS G-BFGS G-CGFR G-CGPR G-CGBS
fitted shape1 2.6645458 2.664132 2.6645398 2.664538 2.6645416 2.6645434 2.6645433 2.6645432 2.6645433
fitted shape2 0.7310305 0.730979 0.7310292 0.731029 0.7310294 0.7310297 0.7310297 0.7310297 0.7310297
fitted loglik 114.1654870 114.165486 114.1654870 114.165487 114.1654870 114.1654870 114.1654870 114.1654870 114.1654870
func. eval. nb. 23.0000000 47.000000 240.0000000 263.000000 183.0000000 20.0000000 249.0000000 225.0000000 138.0000000
grad. eval. nb. 5.0000000 NA 57.0000000 69.000000 47.0000000 5.0000000 71.0000000 69.0000000 43.0000000
time (sec) 0.0100000 0.010000 0.0680000 0.079000 0.0550000 0.0260000 0.2330000 0.2180000 0.1440000
BFGS NM CGFR CGPR CGBS G-BFGS G-CGFR G-CGPR G-CGBS
fitted shape1 2.6645422 2.6642412 2.6645420 2.6645418 2.6645433 2.6645437 2.6645434 2.6645434 2.6645434
fitted shape2 0.7310296 0.7310035 0.7310293 0.7310294 0.7310296 0.7310298 0.7310297 0.7310297 0.7310297
fitted loglik 114.1654870 114.1654863 114.1654870 114.1654870 114.1654870 114.1654870 114.1654870 114.1654870 114.1654870
func. eval. nb. 18.0000000 41.0000000 131.0000000 116.0000000 134.0000000 20.0000000 175.0000000 125.0000000 112.0000000
grad. eval. nb. 5.0000000 NA 27.0000000 29.0000000 35.0000000 5.0000000 39.0000000 41.0000000 35.0000000
time (sec) 0.0080000 0.0080000 0.0370000 0.0360000 0.0420000 0.0260000 0.1400000 0.1380000 0.1500000

Using llsurface, we plot the log-likehood surface around the true value (green) and the fitted parameters (red).

llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), 
          plot.arg=c("shape1", "shape2"), nlev=25,
          plot.np=50, data=x, distr="beta")
points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red")
points(3, 3/4, pch="x", col="green")

We can simulate bootstrap replicates using the bootdist function.

b1 <- bootdist(fitdist(x, "beta", method="mle", optim.method="BFGS"), niter=100, parallel="snow", ncpus=2)
summary(b1)
## Parametric bootstrap medians and 95% percentile CI 
##           Median      2.5%     97.5%
## shape1 2.7314218 2.2724135 3.2834089
## shape2 0.7498259 0.6518008 0.8884238
plot(b1, enhance=TRUE, trueval=c(3, 3/4))

3.4 Conclusion

Based on the beta example, we observe that all methods converge to the same point. This is rassuring.

unconstropt[1:2,"BFGS"]
## fitted shape1 fitted shape2 
##     2.6645458     0.7310305

However, the number of function evaluation (and gradient evaluation) is very different from a method to another. Futhermore, specifying the true gradient of the log-likelihood does not help at all the fitting procedure and generally slows down the convergence. The best methods only evaluates 17 times the log-likelihood.

resall <- cbind(unconstropt, expopt)
cmin <- min(resall["func. eval. nb.", ], na.rm=TRUE)
colnames(resall)[resall["func. eval. nb.", ] == cmin]
## [1] "BFGS"

That is BFGS with the exponential transformation of the parameters. Since the exponential function is differentiable, the asymptotic properties are still preserved (by the Delta method) but for finite-sample this may produce a small bias. The second-best method is the BFGS with neither with gradient, neither with constaint, nor with exponential transform. The difference between BFGS with and without exponential transformation in terms of fitted values are neglible.

4 Zero-inflated negative binomial

TODO!

actuarial dataset