You use gridsearch for hyperparamter optimization and state that at some point you would like to add a bayesian approach. One simple change that could boost the performance, would be to use random search inplace of gridsearch. Grid search is known to perform worse than random search in cases where not all hyperparameters are of similar importance [1]. Intuitively, in grid search a lot of evaluations evaluate the same setting of an important hyperparameter with changing settings of not important ones.
It depends on the number of evaluations. The more evaluations the stronger the model built by the TPE algorithm. With very few evaluations, we would expect TPE to match random sampling. This effect can for example be seen in the plots of the "Bayesian Optimization and Hyperband" paper [1, 2], where the plotted "Bayesian Optimization" approach is TPE.
Also, there might be model bias: for example if the objective function is stochastic (e.g., a reinforcement learning algorithm that only converges sometimes) or not very smooth, TPE might exploit areas that are not actually good based on one good evaluation. In those cases TPE might perform worse than random search! To alleviate the effect of model bias in model based hyperparameter optimization (e.g., TPE) and to obtain convergence guarantees, people often sample every k-th hyperparameter setting from a prior distribution (random search) (this is also the case for the plots in [1, 2].
If you are wondering which HPO algorithm you should use (or HPO in general), I would highly recommend the first part of the AutoML tutorial at NeurIPS2018 [3] given by my advisor.
Please, consider changing instances of "Generates native Python code" to something like "Generates standard Python code without 3rd party libraries dependencies" or equivalent. The term "native" here is not correct and is confusing.
Congratulations on lunching! I'm also working on open source autoML solution and I have similar problem with finding good heuristics for column type inference - I always left the final choice to the user.
Have you compared the solution accuracy with other frameworks? (or is it not the priority for your package right now)
Do you have early stopping implemented?
I will play with your package and come back with more questions.
> Have you compared the solution accuracy with other frameworks? (or is it not the priority for your package right now)
Still working on that.
> Do you have early stopping implemented?
Early stopping is intentionally not implemented in order to keep things apples-to-apples between trials by doing a full run, but I'm open to reconsidering that, or at the least have a user-set option for early stopping.
This is extremely cool/useful. Really appreciated how well the project is explained/motivated, e.g. the DESIGN.md file. Thanks for making it public!
Question about the long-term vision: do you see this as potentially being an educational artifact? Seems like just reading through the code could be a great way to get familiar with good ETL practices, good vectorization practices, etc.
Whenever anyone asks this, I always wonder whether I live in a bubble or they do.
Creating simple predictive models where your problem is already easily narrowed down to a "given x predict y" definition is pretty trivial. Having it automated is nice, but not exactly a hard thing to do.
Genuine question: how many people have jobs where those kinds of problems form any significant part of their workload?
I also often see a response to this sentiment along the lines of, "Yeah, but there's also data cleaning..." etc. My reaction to this is mixed. I mean, sure, there is also data cleaning involved, but is this really where people spend most of their time?
My team spends most of our time doing the following:
1. Formulating problems. Figuring out the various different ways that a real-world problem can be expressed mathematically and feasibly attacked computationally.
2. Engineering software to implement the solutions to these problems, sometimes using some of the (amazing) frameworks out there for ML or probabilistic programming, but often having to develop our own approaches from scratch.
3. Doing all the management, stakeholder relationship stuff, business cases, etc. that make your work relevant and possible.
4. Getting data. Always an issue.
I'm very genuine in my curiosity here: are we total snowflakes, and most data scientists spend their time cleaning data and building "given X predict y" models?
How many business analysts/low level coders have jobs because they just implement the same repeated CRUD screens/wireframes or maintain WordPress themes? Not the same as data science, but close.
I worked at a startup whose first service was "upload CSV, and we automatically generate interesting charts". That's not trivial, but it's not exactly rocket science, either. The main trouble we had is that in order for this to work well (and it makes for a terrific demo), you need to start with a great CSV file. The average CSV file you find on the web isn't.
CSV is about the barest amount of specification in a file format. It's common to run across files which are some weirdo encoding you can't easily detect, or are a mix of multiple encodings, or a mix of line endings, or which should be treated as case-insensitive (or only for some columns), or which have weird number formatting (or units, and not the same units in every row), or typos and spelling errors, or it came from an OCR'd PDF and there's "page 2" right in the middle of it, or they tried to combine multiple files together so there's multiple headers scattered throughout the file (or none at all), or the top has different columns from the bottom, or it uses quoting differently (obviously not per the RFC), or it's assumed that "nil"/"NULL"/""/"-"/"0" are the same, or ...
In short, data (which hasn't been cleaned by hand) sucks, and CSV doubly so. If you want to put your AI/ML smarts to work, write a program to take a shitty CSV file (or even better, a shitty PDF file!) and generate good clean data, plus a description of its schema. That would be an amazing tool.
So far, OpenRefine is the nicest tool for this that I've seen. Figure out how to make it fully automatic, and everybody with piles of raw data (governments) will beat a path to your door.
In my mind, that is a hard question to assess. Hopefully, more tools will exist for automating data cleaning and ETL such as handling third-party data schemas and integrations, data errors, and so forth. I don't this means data science jobs will be eliminated. Rather, one data scientist should be able to handle integrating more data sources and investigating a larger number of models. New and novel data will continuously present itself though, and if the total volume of data continues to grow at its current rate, these tools might just allow for a data scientist to keep pass with the growth of data. Hopefully though, progress allows for everyone to focus on higher-level tasks versus cleaning data and building pipelines.
You use gridsearch for hyperparamter optimization and state that at some point you would like to add a bayesian approach. One simple change that could boost the performance, would be to use random search inplace of gridsearch. Grid search is known to perform worse than random search in cases where not all hyperparameters are of similar importance [1]. Intuitively, in grid search a lot of evaluations evaluate the same setting of an important hyperparameter with changing settings of not important ones.
[1] Section 1.3.1; Automatic Machine Learning: Methods, Systems, Challenges; Chapter Hyperparameter Optimization
https://www.automl.org/wp-content/uploads/2018/11/hpo.pdf