# Penalized regression using only first-semester calculus

### Summary

Modern data analyses often involve too many regressors and not enough observations. In these situations, the OLS regression estimator is highly variable and can lead to poor predictions. A popular alternative estimator is the Lasso regression estimator. In this post I explain how the lasso regression estimator may be computed by iterating the following two lines of code:

```
u<- solve( Q * v%*%t(v) + lambda*diag(p)/2 )%*%( l*v )
v<- solve( Q * u%*%t(u) + lambda*diag(p)/2 )%*%( l*u )
```

Understanding why this iterative algorithm works requires only a basic understanding of linear regression and first-semester calculus. Specifically, the only fact from optimization you really need to know is that

The full article describing this idea and algorithm is available at arXiv:1611.00040. Below is a synopsis and brief example.

### Penalized linear regression

#### Least squares estimation

If you are familiar with the linear regression model you probably know that the OLS regression estimate $\hat \beta_{\text{ols}}$ is the minimizer of the residual sum of squares

It is convenient to rewrite the residual sum of squares as

Since the difference between the RSS at any two values of $\beta$ is not affected by the value of $||y||^2$, we can write

Using calculus or orthogonality considerations, you can show $\hat \beta_{\text{ols}}$ will satisfy $2 Q \hat\beta_{\text{ols}} = 2 l$. If $Q$ is invertible, then

#### Ridge regression

As described above, the OLS estimate has fallen out
of fashion, and people nowadays prefer ~~Bayesian~~
penalized estimates given by

where $f(\beta)$ is some penalty function. One popular penalty function is an $L_2$ or “ridge” penalty, given by $f(\beta) = \lambda\beta^\top \beta$ for some $\lambda>0$. This penalty can also be written as $f(\beta) = \lambda \sum |\beta_j|^2$. For this ridge penalty, the penalized regression estimate is

where the last line follows by using the same logic as used to obtain the OLS estimator, with $Q$ replaced by $Q+\lambda I$. The resulting estimator is called a ridge regression estimator, and happens to be equal to the posterior mean estimator of $\beta$ under a $N(0,1/\lambda)$ prior distribution for the elements of $\beta$.

#### Lasso regression

Another popular penalty on $\beta$ is the $L_1$ or lasso penalty, given by $f(\beta) =\lambda \sum |\beta_j|$. A lasso estimate is given by

Unfortunately there is no closed-form expression for the lasso estimator. To compute it, you need to do one of the following:

- Learn convex optimization and then write an algorithm;
- Use a canned algorithm that you don’t understand;
- Use a trick that only requires knowing first-semester undergrad calculus.

The third option is explained below.

### The Hadamard product parametrization of the $L_1$ penalty

#### The math

The calculus you need to know to understand the trick is as follows: Let $h(x) = x + a/x$ for $x>0$ and some fixed $a>0$. Here is a plot of this function for $a=2$:

This looks convex. Let’s take a derivative and see where it is zero:

So the only critical point is $x= \sqrt{a}$. The picture suggests that this is the function’s minimum, but that’s not enough for full credit on the calculus quiz. Let’s take another derivative:

That’s positive, so indeed $x=\sqrt{a}$ is the minimum of this function. So we have

#### The reparametrization

Now suppose we have a single (scalar) $\beta$ and want to minimize some function $\tilde f(\beta) = f(\beta) + \lambda |\beta|$. Here is one way to do it: Write $\beta=uv$ and find values of $u$ and $v$ that minimize the function

[ \tilde g(u,v) = f(uv) + \lambda u^2/2 + \lambda v^2/2.]

Let’s see why this works:

The first line follows by letting $\beta= uv$. The third line follows from the calculus result above.

#### The generalization

Now let $\beta$ be a vector, and reparametrize as $\beta=u\circ v$ where “$\circ$” is the Hadamard (elementwise) product of the vectors $u$ and $v$. Applying the same logic as above, it follows that

[ \min_\beta f(\beta) + \lambda \sum |\beta_j| = \min_{u,v} f(uv) + \lambda u^\top u /2 + \lambda v^\top v/2 ]

Why might this be helpful? The function to optimize on the left-hand side is convex (if $f$ is) but not differentiable. You need to take more math classes if you want to understand how to optimize this function directly. Alternatively, the function to optimize on the left hand side is differentiable, and can be optimized by iteratively minimizing in $u$ and $v$.

#### Lasso estimates via alternating ridge regressions

Let’s return to the $L_1$ penalized linear regression problem:

As we discussed above, $\hat \beta = \hat u\circ \hat v$ where [ (\hat u,\hat v) = \arg \min_{u,v} \ (u\circ v)^\top Q (u\circ v) - 2 (u\circ v )^\top l + \tfrac{\lambda}{2}u^\top u + \tfrac{\lambda}{2}v^\top v . ] The optimal $u$ for fixed $v$ is

The third line follows from the second by noting that the second line is equivalent to the ridge regression criterion where $u$ is the parameter. A similar result holds for the optimal value of $v$ given $u$. An alternating ridge regression algorithm for finding the lasso estimate $\hat \beta = \hat u\circ \hat v$ is therefore to iterate the following until convergence.

- Set $u =(Q \circ v v^\top + I \lambda/2 )^{-1} (v\circ l)$.
- Set $v =(Q \circ u u^\top + I \lambda/2 )^{-1} (u\circ l)$.

### Numerical example

Let’s try this out with a numerical example - an analysis of some data on diabetes progression. We’ll hold out the first 100 observations to use as a test set, with which we will evaluate the predictive performance of the estimators we obtain.

```
load(url("http://www2.stat.duke.edu/~pdh10/Datasets/yX_diabetes"))
y<-yX_diabetes[-(1:100),1] ; ytest<-yX_diabetes[1:100,1]
X<-yX_diabetes[-(1:100),-1] ; Xtest<-yX_diabetes[1:100,-1]
dim(X)
```

```
## [1] 342 64
```

```
colnames(X)
```

```
## [1] "age" "sex" "bmi" "map" "tc" "ldl" "hdl"
## [8] "tch" "ltg" "glu" "age^2" "bmi^2" "map^2" "tc^2"
## [15] "ldl^2" "hdl^2" "tch^2" "ltg^2" "glu^2" "age:sex" "age:bmi"
## [22] "age:map" "age:tc" "age:ldl" "age:hdl" "age:tch" "age:ltg" "age:glu"
## [29] "sex:bmi" "sex:map" "sex:tc" "sex:ldl" "sex:hdl" "sex:tch" "sex:ltg"
## [36] "sex:glu" "bmi:map" "bmi:tc" "bmi:ldl" "bmi:hdl" "bmi:tch" "bmi:ltg"
## [43] "bmi:glu" "map:tc" "map:ldl" "map:hdl" "map:tch" "map:ltg" "map:glu"
## [50] "tc:ldl" "tc:hdl" "tc:tch" "tc:ltg" "tc:glu" "ldl:hdl" "ldl:tch"
## [57] "ldl:ltg" "ldl:glu" "hdl:tch" "hdl:ltg" "hdl:glu" "tch:ltg" "tch:glu"
## [64] "ltg:glu"
```

First, the OLS estimates:

```
mean( (ytest - Xtest%*%beta_ols)^2 )
```

```
## [1] 0.5966967
```

A lot of the estimated coefficients are close to zero, but of course not quite zero. We could just infer that these estimated effects are “small”, but that doesn’t sound very sophisticated. Instead, let’s use the Hadamard product parametrization to obtain sparse lasso estimates. Here is the code to set up the algorithm:

```
Q<-crossprod(X)
l<-crossprod(X,y)
Il<-diag(ncol(X))*lambda
v<-sqrt(abs(fit_ols$coef))
```

The object `Il`

is just a diagonal matrix
times the penalty parameter $\lambda$. What is
the value of $\lambda$ and where did it come from?
The value is 14.26, and this is
a moment-based empirical Bayes estimate.
Here is the optimization algorithm:

```
for(s in 1:100)
{
u<- solve( Q * v%*%t(v) + Il/2 )%*%( l*v )
v<- solve( Q * u%*%t(u) + Il/2 )%*%( l*u )
}
beta_l1p<-u*v
```

(This code can be sped-up by using Cholesky factorizations - see the code on my website for details).

Here are the resulting lasso estimates. As can be seen, many of the estimated coefficients are zero.

The resulting sparse estimate provide slightly improved predictive performance, as compared to the OLS estimate:

```
mean( (ytest - Xtest%*%beta_l1p)^2 )
```

```
## [1] 0.5337473
```

Still, there are a lot of coefficient estimates that are small, but not quite zero. Can’t we zap these away too?

We’ve seen how writing $\beta = u\circ v$ leads to penalized (lasso) estimates of $\beta$. Maybe to get more penalization we could write $\beta$ as a product of more things:

```
v<-w<-x<-(abs(beta_ols))^(.25)
for(s in 1:100)
{
u<- solve( Q * (v*w*x)%*%t(v*w*x) + Il/4 )%*%( l*v*w*x )
v<- solve( Q * (u*w*x)%*%t(u*w*x) + Il/4 )%*%( l*u*w*x )
w<- solve( Q * (u*v*x)%*%t(u*v*x) + Il/4 )%*%( l*u*v*x )
x<- solve( Q * (u*v*w)%*%t(u*v*w) + Il/4 )%*%( l*u*v*w )
}
beta_lhp<-u*v*w*x
```

Indeed, this has led to a “stronger” penalty in the sense that now we have mostly coefficients that are zero, and a few coefficients that are reasonably far from zero. This is very aesthetically pleasing, and even better, leads to improved predictive performance:

```
mean( (ytest - Xtest%*%beta_lhp)^2 )
```

```
## [1] 0.5187066
```

What is this crazy estimate, that was obtained by writing
$\beta$ as the Hadamard product of four quantities?
With a similar calculus trick that was used above,
you can show that the resulting estimate
`beta_lhp`

is a local minimizer of the $L_{1/2}$-penalized
residual sum of squares:

[ \hat \beta_{1/2} = \arg \min ||y-X\beta||^2 + \lambda ||\beta||_{1/2}. ]

This penalty is non-convex, and allows for more shrinkage of the small coefficients without biasing the larger coefficients as much. I also should have noted above that the empirical Bayes estimate of $\lambda$ that yielded these $L_{1/2}$-penalized estimates was 10.17, which is different than the value used for the lasso estimates.

More on the correspondence between $L_q$ penalties and the Hadamard product parametrization can be found in my article arXiv:1611.00040. The article also includes

- a comparison to other optimization methods;
- an algorithm for penalized logistic regression;
- a new penalty for spatially structured sparsity.