Hi everyone. I'm the author. There's one more I thing I wanted to add: a good reason you should try using some sort of hyperparameter search, even you think it's a complete waste of time and compute, is for reproducibility.
This probably applies more to open-source academic contributions, where you're trying to help your fellow practitioners recreate and use your models, as opposed to a corporate setting, where reproducibility would be the equivalent of getting fired.
Recently, I was trying to train a ResNet to beat the top Stanford DAWNBench entry (spoiler alert: I did, but by less than a second). Initially, I blindly tried manually tuning the learning rate, batch size, etc. without even reading the original model's guidelines.
After actually going through a blog post written by the David C Page (the guy with the top DAWNBench entry), I saw that he tried varying the hyperparameters himself and that the ones that were set by default in the code were what he found to be optimal.
That saved me a lot of time and let me focus on other things like what hardware to use.
I think the lesson here is that if more researchers perform and publish the results of some basic hyperparameter optimization, it would really save the world a whole lot of epochs.
I enjoyed the article, and I know that writing these takes a nontrivial amount of time. So I think it would be wise of you to run these through a spell checker before publishing, as this is a less than a minute investment which pays off every time someone reads it.
> The heavier the ball, the quicker it falls. But if it’s too heavy, it can get stuck or overshoot the target.
This explanation of momentum is somewhere between misleading and wrong. Momentum is about inertia and acceleration, i.e., the ability to quickly change speed.
A misleading experiment: the heavier the feather the quicker it falls is obviously true; steel feathers are useless. The same is true for balls (except for the exceptional situation of a perfect vacuum), it's just that drag and air currents don't influence balls all that much at low speeds.
I realize this is classic old-man-yells-at-cloud, but I don't understand why every online article these days, even the technical ones, need to have a giant "amusing" gif every two paragraphs. Do people not pay attention otherwise?
I noped out of there after seeing those and the terminator reference in the first paragraph. Maybe I'm not the target audience...
Seems like the vast majority of the DL articles that make it the front of HN are just fluff. Nothing for DL practitioners, just 'hey look I can import tensorflow'.
There are also several papers and blog posts diving into details and tradeoffs of different Bayesian optimization approaches and components here [0]. Example: Covariance Kernels for Avoiding Boundaries [1]
I always appreciate articles emphasizing the importance of hyperparameter optimization; thank you for writing this. The discussion on learning rate is nice additional point to mention, though I find it a bit misleading -- earlier in the discussion you are mentioning a number of hyperparameters but then learning rate is studied in a vacuum. If other hyperparameters were varied along with the learning rate, I assume those graphics would look much more complicated.
Additionally, practical circumstances for hyperparameter tuning using Bayesian optimization often include complications: dealing with discrete hyperparameters, large parameter spaces being unreasonably costly or poorly modeled, accounting for uncertainty in your metric, balancing competing metrics, black-box constraints. Obviously, one cannot mention everything in a blog post, I just wanted to bring up that outstanding researchers in Bayesian optimization are pushing forward on all of these topics.
Regardless, thank you for continuing to hammer home the value of hyperparameter optimization. If I may, a couple links, for anyone trying to learn more:
I'm using NNI[1] with decent success for hyperparameter optimization. It implements a number of different approaches, from a simple random search to a Tree Parzen Estimator (TPE) and specialized algorithms for automatically designing networks.
It's very powerful and gives you a lot of freedom (it can minimize/maximize the output of fundamentally any python program). The main drawback is that you are on your own to figure out which paramters go well together: For example using an assesor to stop underperforming attempts early is great for random search, but devastating for TPE. You have to figure that out on your own. You inevitably spend some time tuning your hyperparameter tuner. It's still a big win in terms of human effort, at the expense of doing a lot more computing.
Bayesian parameter estimation typically trains an emulator to reproduce the objective function using a limited number of design point (order 10 per dimension). Once the emulator is trained, you could of course use a multi dimensional minimization function of your choice to find the best fit point.
However, constructing and sampling the Bayesian posterior using MCMC methods has several advantages. Sometimes you can have a local minimum which is essentially flat, so the optimal hyperparameter is unstable. You'll see this in the posterior distribution. Or you could have two parameters which are correlated so it's their sum that's constrained not their individual values. All this information provides important context when understanding your model's uncertainty.
Thanks. Do I understand right that the Bayesian gaussian-process things people do here use only the fully trained loss as input, i.e. just one number L(W), being minimised over W? As opposed to something more detailed about the model, viewed as generating probabilities perhaps, or having training history.
Big nearly-flat areas aren't really a new feature of hyperparameter problems... I guess the exact choice of algorithm would depend on how common they are, and maybe Nelder-Mead would be a poor choice. (And I'm not sure how easy it is to parallelise.)
This NVIDIA post goes into extending Bayesian Optimization to multiple metrics [0]. It shows how you can use efficient optimization to find a good Pareto Frontier[1].
DFO is derivative free optimization. With multiple objectives you try to find different solutions given different weightings to the objectives for the Pareto front and pick one depending on the domain.
It requires massive amount of computing power, otherwise theoretically you should be able to explore different optimizations automatically. Even then, validation is still hard and time consuming though.
It sounds like an easy way to increase performance but really exploring the hyperparameter space is likely more efficiently done manually first and only automatically when you have figured out how to distribute the work.
I am working on a little python framework to efficiently distribute hyperparameter search on a Spark cluster. We haven't released the first version yet but will do so in the next two weeks. https://github.com/logicalclocks/maggy
A limitation of existing hyperparameter search algorithms is that they are typically stage or generation-based. For example, if genetic algorithms are used for hyperparameter search, one has to wait for all models to finish in order to generate a new generation of potential parameters from the best performing individuals. However, some instances will have suboptimal parameters during a given iteration and will know quickly during the training that they can stop early. Hence, the early stopped machine can’t be provided with a new set of parameters early but is instead idle.
Compared to stage-based algorithms like genetic optimization algorithms, maggy (the framework) will support asynchronous algorithms, that are able to provide new candidate sets of parameters as soon as a worker finishes evaluating a combination and does not have to wait until all models in one stage finish. For this to be possible, we establish communication between the driver and executors in Spark. The driver will then collect performance metrics during training which enables us to stop badly performing models early during training and reassigning the executor task with a new, more promising set of parameters (new trial) right away, instead of waiting for a stage to finish.
Manually searching is time taking since you need to wait for the results from each experiment. This becomes impossible when the number of hyperparameters is more than 8-10 and you will probably end up only tuning a few of them that you think are relevant. You'd also need a lot of experience in tuning hyperparameters else your tuning is as good as random.
Given these disadvantages of manual tuning, "Bayesian Optimization" seems like the most promising technique, it needs a lot less "choose->train->evals" loops as it uses the information from previous runs to select the next set of hyperparameters (similar to what humans would do).
It depends on how well the problem is understood. If the problem is your standard MNIST dataset then sure it could very well be a waste of time to sit around and serialize your manual hyper param search. For any new datasets which may or may not be cleaned theres much to be learned from iterating on a very small subset of the data, at that small scale it's much easier to get a handle on the major failings, such as encoding the wrong things or weight explosion.
Sure, it does, it's not trivial though. Tedious to implement it yourself. You could use python libraries as "scikit-optimize" which has an implementation of parallel Bayesian optimization (based on Gaussian process), have a look at this: https://scikit-optimize.github.io/notebooks/bayesian-optimiz...
I've done hyperparameter searches manually, they're widely used in academic labs ("hyperparameter descent by grad student"), and I've also done a bit of hyperparameter automatic search, but I can't see what you meant.
"hyperparameter descent by grad student" is a lot more efficient at first, much of the time the loss function has caveats which make it easy to fall into parts of the search space which don't actually accomplish the task (for example when empty frame = true reduces the loss) something a grad student would easily figure out. Until you get to the point where you are fairly certain the search is within the right space its hard to ensure that throwing a lot of compute at that search will yield anything useful.
This probably applies more to open-source academic contributions, where you're trying to help your fellow practitioners recreate and use your models, as opposed to a corporate setting, where reproducibility would be the equivalent of getting fired.
Recently, I was trying to train a ResNet to beat the top Stanford DAWNBench entry (spoiler alert: I did, but by less than a second). Initially, I blindly tried manually tuning the learning rate, batch size, etc. without even reading the original model's guidelines.
After actually going through a blog post written by the David C Page (the guy with the top DAWNBench entry), I saw that he tried varying the hyperparameters himself and that the ones that were set by default in the code were what he found to be optimal.
That saved me a lot of time and let me focus on other things like what hardware to use.
I think the lesson here is that if more researchers perform and publish the results of some basic hyperparameter optimization, it would really save the world a whole lot of epochs.