A Practical Guide to Hyperparameter Optimization

iyaja · on April 18, 2019

Hi everyone. I'm the author. There's one more I thing I wanted to add: a good reason you should try using some sort of hyperparameter search, even you think it's a complete waste of time and compute, is for reproducibility.

This probably applies more to open-source academic contributions, where you're trying to help your fellow practitioners recreate and use your models, as opposed to a corporate setting, where reproducibility would be the equivalent of getting fired.

Recently, I was trying to train a ResNet to beat the top Stanford DAWNBench entry (spoiler alert: I did, but by less than a second). Initially, I blindly tried manually tuning the learning rate, batch size, etc. without even reading the original model's guidelines.

After actually going through a blog post written by the David C Page (the guy with the top DAWNBench entry), I saw that he tried varying the hyperparameters himself and that the ones that were set by default in the code were what he found to be optimal.

That saved me a lot of time and let me focus on other things like what hardware to use.

I think the lesson here is that if more researchers perform and publish the results of some basic hyperparameter optimization, it would really save the world a whole lot of epochs.

besulzbach · on April 18, 2019

I enjoyed the article, and I know that writing these takes a nontrivial amount of time. So I think it would be wise of you to run these through a spell checker before publishing, as this is a less than a minute investment which pays off every time someone reads it.

d__k · on April 18, 2019

> The heavier the ball, the quicker it falls. But if it’s too heavy, it can get stuck or overshoot the target.

This explanation of momentum is somewhere between misleading and wrong. Momentum is about inertia and acceleration, i.e., the ability to quickly change speed.

polynomial · on April 18, 2019

> The heavier the ball, the quicker it falls

Wasn't there a famous experiment about this someone once did?

wongarsu · on April 18, 2019

A misleading experiment: the heavier the feather the quicker it falls is obviously true; steel feathers are useless. The same is true for balls (except for the exceptional situation of a perfect vacuum), it's just that drag and air currents don't influence balls all that much at low speeds.

dfan · on April 18, 2019

I realize this is classic old-man-yells-at-cloud, but I don't understand why every online article these days, even the technical ones, need to have a giant "amusing" gif every two paragraphs. Do people not pay attention otherwise?

andbberger · on April 18, 2019

I noped out of there after seeing those and the terminator reference in the first paragraph. Maybe I'm not the target audience...

Seems like the vast majority of the DL articles that make it the front of HN are just fluff. Nothing for DL practitioners, just 'hey look I can import tensorflow'.

polynomial · on April 18, 2019

> Nothing for DL practitioners, just 'hey look I can import tensorflow'.

Which is literally the ironic reference in the first image.

stronglikedan · on April 18, 2019

It's to help end-users reach their data limits quicker, so they're not leaving money on the table at the end of each month.

ArtWomb · on April 18, 2019

More essential background on Bayesian optimization in AutoML's HyperTune

https://cloud.google.com/blog/products/gcp/hyperparameter-tu...

Zephyr314 · on April 18, 2019

There are also several papers and blog posts diving into details and tradeoffs of different Bayesian optimization approaches and components here [0]. Example: Covariance Kernels for Avoiding Boundaries [1]

[0]: https://sigopt.com/research/

[1]: https://sigopt.com/blog/covariance-kernels-for-avoiding-boun...

manneshiva · on April 18, 2019

Even Facebook uses Bayesian Optimization for tuning the parameters of some of its online systems (like its internal ranking system): https://research.fb.com/efficient-tuning-of-online-systems-u...

mccourt · on April 18, 2019

I always appreciate articles emphasizing the importance of hyperparameter optimization; thank you for writing this. The discussion on learning rate is nice additional point to mention, though I find it a bit misleading -- earlier in the discussion you are mentioning a number of hyperparameters but then learning rate is studied in a vacuum. If other hyperparameters were varied along with the learning rate, I assume those graphics would look much more complicated.

Additionally, practical circumstances for hyperparameter tuning using Bayesian optimization often include complications: dealing with discrete hyperparameters, large parameter spaces being unreasonably costly or poorly modeled, accounting for uncertainty in your metric, balancing competing metrics, black-box constraints. Obviously, one cannot mention everything in a blog post, I just wanted to bring up that outstanding researchers in Bayesian optimization are pushing forward on all of these topics.

Regardless, thank you for continuing to hammer home the value of hyperparameter optimization. If I may, a couple links, for anyone trying to learn more:

My favorite BO intro - https://arxiv.org/abs/1807.02811 AutoML from the Freiburg crew - http://papers.nips.cc/paper/5872-efficient-and-robust-automa... Some discussion on parallelism/high dimensions - https://bayesopt.github.io/papers/2017/3.pdf Strategies for warm starting - https://ml.informatik.uni-freiburg.de/papers/18-AUTOML-RGPE....

wongarsu · on April 18, 2019

I'm using NNI[1] with decent success for hyperparameter optimization. It implements a number of different approaches, from a simple random search to a Tree Parzen Estimator (TPE) and specialized algorithms for automatically designing networks.

It's very powerful and gives you a lot of freedom (it can minimize/maximize the output of fundamentally any python program). The main drawback is that you are on your own to figure out which paramters go well together: For example using an assesor to stop underperforming attempts early is great for random search, but devastating for TPE. You have to figure that out on your own. You inevitably spend some time tuning your hyperparameter tuner. It's still a big win in terms of human effort, at the expense of doing a lot more computing.

1: https://github.com/Microsoft/nni

MasterScrat · on April 18, 2019

I'm used to hyperopt, do you know how they compare?

Zephyr314 · on April 18, 2019

hyperopt also uses TPEs [0], this may be a variant/fork of that.

[0]: http://hyperopt.github.io/hyperopt/

improbable22 · on April 18, 2019

Is there a good reason not to regard this as a standard few-parameter no-gradient optimisation problem, and use something like Nelder-Mead on it?

tictacttoe · on April 18, 2019

Bayesian parameter estimation typically trains an emulator to reproduce the objective function using a limited number of design point (order 10 per dimension). Once the emulator is trained, you could of course use a multi dimensional minimization function of your choice to find the best fit point.

However, constructing and sampling the Bayesian posterior using MCMC methods has several advantages. Sometimes you can have a local minimum which is essentially flat, so the optimal hyperparameter is unstable. You'll see this in the posterior distribution. Or you could have two parameters which are correlated so it's their sum that's constrained not their individual values. All this information provides important context when understanding your model's uncertainty.

improbable22 · on April 18, 2019

Thanks. Do I understand right that the Bayesian gaussian-process things people do here use only the fully trained loss as input, i.e. just one number L(W), being minimised over W? As opposed to something more detailed about the model, viewed as generating probabilities perhaps, or having training history.

Big nearly-flat areas aren't really a new feature of hyperparameter problems... I guess the exact choice of algorithm would depend on how common they are, and maybe Nelder-Mead would be a poor choice. (And I'm not sure how easy it is to parallelise.)

bigred100 · on April 18, 2019

I think many people (including the DFO community) already do that. People also consider the notion of multiple objectives important here I believe

Zephyr314 · on April 18, 2019

This NVIDIA post goes into extending Bayesian Optimization to multiple metrics [0]. It shows how you can use efficient optimization to find a good Pareto Frontier[1].

[0]: https://devblogs.nvidia.com/sigopt-deep-learning-hyperparame...

[1]: https://en.wikipedia.org/wiki/Pareto_efficiency

improbable22 · on April 18, 2019

Thanks. What's DFO? And what do you usefully do with multiple objectives, besides minimise some total?

jcagalawan · on April 18, 2019

DFO is derivative free optimization. With multiple objectives you try to find different solutions given different weightings to the objectives for the Pareto front and pick one depending on the domain.

platz · on April 18, 2019

I enjoy the f-you tone of this article

stunt · on April 18, 2019

It requires massive amount of computing power, otherwise theoretically you should be able to explore different optimizations automatically. Even then, validation is still hard and time consuming though.

jjn2009 · on April 18, 2019

It sounds like an easy way to increase performance but really exploring the hyperparameter space is likely more efficiently done manually first and only automatically when you have figured out how to distribute the work.

moritzmeister · on April 18, 2019

I am working on a little python framework to efficiently distribute hyperparameter search on a Spark cluster. We haven't released the first version yet but will do so in the next two weeks. https://github.com/logicalclocks/maggy

A limitation of existing hyperparameter search algorithms is that they are typically stage or generation-based. For example, if genetic algorithms are used for hyperparameter search, one has to wait for all models to finish in order to generate a new generation of potential parameters from the best performing individuals. However, some instances will have suboptimal parameters during a given iteration and will know quickly during the training that they can stop early. Hence, the early stopped machine can’t be provided with a new set of parameters early but is instead idle.

Compared to stage-based algorithms like genetic optimization algorithms, maggy (the framework) will support asynchronous algorithms, that are able to provide new candidate sets of parameters as soon as a worker finishes evaluating a combination and does not have to wait until all models in one stage finish. For this to be possible, we establish communication between the driver and executors in Spark. The driver will then collect performance metrics during training which enables us to stop badly performing models early during training and reassigning the executor task with a new, more promising set of parameters (new trial) right away, instead of waiting for a stage to finish.

jjn2009 · on April 18, 2019

Nice. I've been using hyperopt with random search, I'll definitely check out your work.

manneshiva · on April 18, 2019

Manually searching is time taking since you need to wait for the results from each experiment. This becomes impossible when the number of hyperparameters is more than 8-10 and you will probably end up only tuning a few of them that you think are relevant. You'd also need a lot of experience in tuning hyperparameters else your tuning is as good as random.

Given these disadvantages of manual tuning, "Bayesian Optimization" seems like the most promising technique, it needs a lot less "choose->train->evals" loops as it uses the information from previous runs to select the next set of hyperparameters (similar to what humans would do).

jjn2009 · on April 18, 2019

It depends on how well the problem is understood. If the problem is your standard MNIST dataset then sure it could very well be a waste of time to sit around and serialize your manual hyper param search. For any new datasets which may or may not be cleaned theres much to be learned from iterating on a very small subset of the data, at that small scale it's much easier to get a handle on the major failings, such as encoding the wrong things or weight explosion.

dual_basis · on April 18, 2019

Does it work in parallel though?

manneshiva · on April 18, 2019

Sure, it does, it's not trivial though. Tedious to implement it yourself. You could use python libraries as "scikit-optimize" which has an implementation of parallel Bayesian optimization (based on Gaussian process), have a look at this: https://scikit-optimize.github.io/notebooks/bayesian-optimiz...

ovi256 · on April 18, 2019

>how to distribute the work

What do you mean by distribute the work ?

I've done hyperparameter searches manually, they're widely used in academic labs ("hyperparameter descent by grad student"), and I've also done a bit of hyperparameter automatic search, but I can't see what you meant.

jjn2009 · on April 18, 2019

"hyperparameter descent by grad student" is a lot more efficient at first, much of the time the loss function has caveats which make it easy to fall into parts of the search space which don't actually accomplish the task (for example when empty frame = true reduces the loss) something a grad student would easily figure out. Until you get to the point where you are fairly certain the search is within the right space its hard to ensure that throwing a lot of compute at that search will yield anything useful.