Who needs gradient descent? Getting to grips with Nevergrad.



Nevergrad is an interesting and quite wonderful Python library that I’ve grown fond of over the last few months. It provides tooling to run numerical optimization without having to rely on computing the gradient of the function you’re optimizing (in stark contrast to the current algorithm du jour of the deep learning world, gradient descent). Think evolutionary algorithms, particle swarm optimization and so on.

I like a few things about nevergrad, and it’s since become my go-to tool for ad-hoc optimization problems, mostly due to how easy it has been to drop into my code. So, today, I’d like to talk about nevergrad for a bit.

What I really like about Nevergrad

My appreciation for nevergrad boils down to two things:

  1. Easy of use
  2. Flexibility

As you’ll see in the example latter on, nevergrad is surprisingly quick to integrate into your existing code. From an end-user’s point of view, it’s almost impossible to overstate how important it is to minimize the time between “zero” and “Hello world!". To begin optimizing with nevergrad, it takes only a couple lines of code. I love this aspect. This emphasis on designing the API so that you can be productive immediately is something I increasingly appreciate in software that chooses to optimize developer productivity. Full marks.

Despite this, nevergrad is also relatively flexible. Utilizing “gradient-free” optimization means you don’t need to worry about having your code be differentiable by some opaque backend. This allows you to wrap any function you wish, with no additional modification required on your part. As long as you output some kind of computed loss that you want minimized, you can use nevergrad. No matter the control flow. This makes it especially handy when you’re trying to figure out hyperparameters for a certain process – you don’t need to modify your code to make it compatible. Just wrap nevergrad around it and away you go. For example, I once wrapped a process that involved training an ML model, followed by running an ODE numerical solver, before finally calculating a set of metrics, each with their own set of tunable parameters. I was able to find a close-to-optimal solution using nevergrad without having to modify any of the process code.

A basic example

As always, an example is (usually) best. Imagine you want to find the minimum point of the following parabola:


which has the following expression: $$ y = (x - 2)^2. $$

With nevergrad, all we need to do is this:

import nevergrad as ng

def square_function(x):
    return (x-2)**2

x_arg = ng.p.Scalar()  # Define our argument to the function
instru = ng.p.Instrumentation(x_arg)  # Package it as "instrumentation"

optimizer = ng.optimizers.OnePlusOne(parametrization=instru, budget=100)  # Choose an optimizer, passing through our instrumentation
recommendation = optimizer.minimize(square_function)  # Find the minima!

# >>> ((1.9998271159560648,), {})

In other words,

  1. Define our function.
  2. Define our input arguments to the function This happens to be the explicit way of defining our instrumentation. There is convenient short-hand as well. (there are lots to choose from, including integer values and choices!)
  3. Choose our optimization algorithm.
  4. Optimize.


A more practical example: hyperparameter optimization

A more practical example (and one where I originally discovered the value of nevergrad) is for hyperparameter optimization.

While working on my PhD I encountered the following problem. I wanted to predict a smooth curve (that I had measured) by adding together a number of scaled smooth kernels. And I would scale the kernels according to some input data.


So, the general process was “scale” each of the kernels according to data, then add them together to create the final smooth curve. The code looks something like this:

X, y = create_training_data(samples)
X = make_features(X_train)

kernels = make_kernels(num_kernels, size, width)

model = train_model(X, y, kernels)
y_hat = model.predict(X)
score = mean_squared_error(y, y_hat)

The model would give me the weights of each kernel, but:

  1. How many kernels should there be?
  2. What size should each kernel start as?
  3. What width should each kernel be?

Choosing the right (or wrong) values for the above parameters (number of kernels, their size and their width) would dramatically effect the accuracy of the final smooth curve.

I could try every possible combination of values, and choose the one that works the best, but this is an incredibly time consuming process. I have thousands of these smooth curves I want to predict, and training the model takes too long to make this an attractive option. So let’s use nevergrad to find a optimal set of hyperparameters instead.

First, we need to wrap our above code in a function that we can optimize over:

def calculate_model_score(num_kernels, size, width):
    X, y = create_training_data(samples)
    X = make_features(X_train)

    kernels = make_kernels(num_kernels, size, width)

    model = train_model(X, y, kernels)
    y_hat = model.predict(X)
    score = mean_squared_error(y, y_hat)

And do our instrumentation and optimization as before: Notice here that we also further customize our instrumentation by setting upper and lower limits, and make sure the number of kernels is an integer.

num_kernels = ng.p.Scalar(lower=10, upper=60).set_integer_casting()
size = ng.p.Scalar(lower=0.001, upper=10)
shape = ng.p.Scalar(lower=0.0001, upper=0.01)

instru = ng.p.Instrumentation(num_kernels, size, shape)
optimizer = ng.optimizers.OnePlusPne(parametrization=instru, budget=300)
# >>> (53, 1.8828894598055905, 0.004852569379631386)

And there we have it, 53 kernels with a size of 1.882 and width of 0.0048 gives us our best result. Note:

  1. We didn’t have to modify any of the logic in our code (just wrap it in a self-containing function)
  2. We had to add ~7 independent lines of code (less, if you want to be more terse).

It’s refreshing to have a tool this useful integrate so easily and seamlessly into your existing code, and this is often why I find myself reaching for nevergrad instead of refactoring code to work with something else, like jax. Sometimes some quick-and-dirty optimization using an evolutionary algorithm can be good enough (not to mention fast). Sure, this particular example isn’t the ideal candidate for gradient descent, but refactoring so that these kernel parameters become model parameters, and then optimizing using jax and gradient descent is something that I’ve done before. It worked too – but it took a lot longer than simply plugging in nevergrad.

Oh, I’m we’re barely scratching the surface of what nevergrad can do, particularly when interacting with ML workflows.

Closing thoughts

And that’s nevergrad, the optimization library that doesn’t use gradient descent. It’s easy to use (and easy to integrate), very flexible and has a decent API. I happen to like it a fair deal, and expect to use it more frequently in the future. If it sounds like something you’re interested, I can highly recommend it. Check out the documentation, if you want to read more.

Till next time,