# Who needs gradient descent? Getting to grips with Nevergrad.

## Introduction

Nevergrad is an interesting and quite wonderful Python library that I’ve grown
fond of over the last few months. It provides tooling to run numerical
optimization *without* having to rely on computing the gradient of the function
you’re optimizing (in stark contrast to the current algorithm *du jour* of
the deep learning world, gradient
descent). Think
evolutionary algorithms, particle
swarm optimization
and so on.

I like a few things about `nevergrad`

, and it’s since become my go-to tool for ad-hoc
optimization problems, mostly due to how easy it has been to drop into
my code. So, today, I’d like to talk about `nevergrad`

for a bit.

## What I really like about Nevergrad

My appreciation for `nevergrad`

boils down to two things:

- Easy of use
- Flexibility

As you’ll see in the example latter on, `nevergrad`

is surprisingly quick to
integrate into your existing code. From an end-user’s point of view, it’s almost
impossible to overstate how important it is to minimize the time between “zero”
and “Hello world!". To begin optimizing with `nevergrad`

, it takes only a couple
lines of code. I *love* this aspect. This emphasis on designing the API so that
you can be productive immediately is something I increasingly appreciate in
software that chooses to optimize developer productivity. Full marks.

Despite this, `nevergrad`

is also relatively flexible. Utilizing “gradient-free”
optimization means you don’t need to worry about having your code be
differentiable by some opaque backend. This allows you to wrap *any* function
you wish, with no additional modification required on your part. As long as you
output some kind of computed loss that you want minimized, you can use `nevergrad`

. No
matter the control flow. This makes it especially handy when you’re trying to
figure out hyperparameters for a certain process – you don’t need to modify
your code to make it compatible. Just wrap `nevergrad`

around it and away you
go. For example, I once
wrapped a process that involved training an ML model, followed by running an ODE
numerical solver, before finally calculating a set of metrics, each with their
own set of tunable parameters. I was able to find a close-to-optimal solution
using `nevergrad`

without having to modify any of the process code.

## A basic example

As always, an example is (usually) best. Imagine you want to find the minimum point of the following parabola:

which has the following expression: $$ y = (x - 2)^2. $$

With `nevergrad`

, all we need to do is this:

```
import nevergrad as ng
def square_function(x):
return (x-2)**2
x_arg = ng.p.Scalar() # Define our argument to the function
instru = ng.p.Instrumentation(x_arg) # Package it as "instrumentation"
optimizer = ng.optimizers.OnePlusOne(parametrization=instru, budget=100) # Choose an optimizer, passing through our instrumentation
recommendation = optimizer.minimize(square_function) # Find the minima!
print(recommendation.value)
# >>> ((1.9998271159560648,), {})
```

In other words,

- Define our function.
- Define our input arguments to the function This happens to be the explicit way of defining our instrumentation. There is convenient short-hand as well. (there are lots to choose from, including integer values and choices!)
- Choose our optimization algorithm.
- Optimize.

Nice!

## A more practical example: hyperparameter optimization

A more practical example (and one where I originally discovered the value of
`nevergrad`

) is for hyperparameter optimization.

While working on my PhD I encountered the following problem. I wanted to predict a smooth curve (that I had measured) by adding together a number of scaled smooth kernels. And I would scale the kernels according to some input data.

So, the general process was “scale” each of the kernels according to data, then add them together to create the final smooth curve. The code looks something like this:

```
X, y = create_training_data(samples)
X = make_features(X_train)
kernels = make_kernels(num_kernels, size, width)
model = train_model(X, y, kernels)
y_hat = model.predict(X)
score = mean_squared_error(y, y_hat)
```

The model would give me the weights of each kernel, but:

- How many kernels should there be?
- What size should each kernel start as?
- What width should each kernel be?

Choosing the right (or wrong) values for the above parameters (number of kernels, their size and their width) would dramatically effect the accuracy of the final smooth curve.

I *could* try every possible combination of values, and choose the one that
works the best, but this is an incredibly time consuming process. I have
thousands of these smooth curves I want to predict, and training the model takes
too long to make this an attractive option. So let’s use `nevergrad`

to find a
optimal set of hyperparameters instead.

First, we need to wrap our above code in a function that we can optimize over:

```
def calculate_model_score(num_kernels, size, width):
X, y = create_training_data(samples)
X = make_features(X_train)
kernels = make_kernels(num_kernels, size, width)
model = train_model(X, y, kernels)
y_hat = model.predict(X)
score = mean_squared_error(y, y_hat)
```

And do our instrumentation and optimization as before: Notice here that we also further customize our instrumentation by setting upper and lower limits, and make sure the number of kernels is an integer.

```
num_kernels = ng.p.Scalar(lower=10, upper=60).set_integer_casting()
size = ng.p.Scalar(lower=0.001, upper=10)
shape = ng.p.Scalar(lower=0.0001, upper=0.01)
instru = ng.p.Instrumentation(num_kernels, size, shape)
optimizer = ng.optimizers.OnePlusPne(parametrization=instru, budget=300)
recommendation=optimizer.minimize(calculate_model_score)
print(recommendation.value[0])
# >>> (53, 1.8828894598055905, 0.004852569379631386)
```

And there we have it, 53 kernels with a size of 1.882 and width of 0.0048 gives us our best result. Note:

- We didn’t have to modify any of the logic in our code (just wrap it in a self-containing function)
- We had to add ~7 independent lines of code (less, if you want to be more terse).

It’s refreshing to have a tool this useful integrate so easily and seamlessly
into your existing code, and this is often why I find myself reaching
for `nevergrad`

instead of refactoring code to work with something else, like
`jax`

. Sometimes some quick-and-dirty optimization using an evolutionary
algorithm can be good enough (not to mention *fast*). Sure, this particular
example isn’t the ideal candidate for gradient descent, but refactoring so that
these kernel parameters become *model parameters*, and then optimizing using
`jax`

and gradient descent *is* something that I’ve done before. It worked too
– but it took a lot longer than simply plugging in `nevergrad`

.

Oh, I’m we’re barely scratching the surface of what `nevergrad`

can do,
particularly when interacting with ML
workflows.

## Closing thoughts

And that’s `nevergrad`

, the optimization library that doesn’t use gradient
descent. It’s easy to use (and easy to integrate), very flexible and has a
decent API. I happen to like it a fair deal, and expect to use it more
frequently in the future. If it sounds like something you’re interested, I can
highly recommend it. Check out the
documentation, if you
want to read more.

Till next time,

Michael.