Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Provide a CSV and a target field, generate a model and code to run it (github.com/minimaxir)
172 points by minimaxir on March 26, 2019 | hide | past | favorite | 26 comments


Nice project!

You use gridsearch for hyperparamter optimization and state that at some point you would like to add a bayesian approach. One simple change that could boost the performance, would be to use random search inplace of gridsearch. Grid search is known to perform worse than random search in cases where not all hyperparameters are of similar importance [1]. Intuitively, in grid search a lot of evaluations evaluate the same setting of an important hyperparameter with changing settings of not important ones.

[1] Section 1.3.1; Automatic Machine Learning: Methods, Systems, Challenges; Chapter Hyperparameter Optimization

https://www.automl.org/wp-content/uploads/2018/11/hpo.pdf


What about HyperOpt?


HyperOpt is a library for hyperparameter optimization, my comment was about algorithms. From the homepage of hyperopt:

"Currently two algorithms are implemented in hyperopt: Random Search Tree of Parzen Estimators (TPE)"

(TPE is a Bayesian Optimization algorithm)


Ah yes indeed, I thought that hyperopt was actually the name for the TPE algorithm. TPE should outperform random search in all cases, right?


It depends on the number of evaluations. The more evaluations the stronger the model built by the TPE algorithm. With very few evaluations, we would expect TPE to match random sampling. This effect can for example be seen in the plots of the "Bayesian Optimization and Hyperband" paper [1, 2], where the plotted "Bayesian Optimization" approach is TPE.

Also, there might be model bias: for example if the objective function is stochastic (e.g., a reinforcement learning algorithm that only converges sometimes) or not very smooth, TPE might exploit areas that are not actually good based on one good evaluation. In those cases TPE might perform worse than random search! To alleviate the effect of model bias in model based hyperparameter optimization (e.g., TPE) and to obtain convergence guarantees, people often sample every k-th hyperparameter setting from a prior distribution (random search) (this is also the case for the plots in [1, 2].

If you are wondering which HPO algorithm you should use (or HPO in general), I would highly recommend the first part of the AutoML tutorial at NeurIPS2018 [3] given by my advisor.

[1] https://www.automl.org/blog_bohb/

[2] http://proceedings.mlr.press/v80/falkner18a.html

[3] https://nips.cc/Conferences/2018/Schedule?showEvent=10979


Please, consider changing instances of "Generates native Python code" to something like "Generates standard Python code without 3rd party libraries dependencies" or equivalent. The term "native" here is not correct and is confusing.


Congratulations on lunching! I'm also working on open source autoML solution and I have similar problem with finding good heuristics for column type inference - I always left the final choice to the user.

Have you compared the solution accuracy with other frameworks? (or is it not the priority for your package right now)

Do you have early stopping implemented?

I will play with your package and come back with more questions.


> Have you compared the solution accuracy with other frameworks? (or is it not the priority for your package right now)

Still working on that.

> Do you have early stopping implemented?

Early stopping is intentionally not implemented in order to keep things apples-to-apples between trials by doing a full run, but I'm open to reconsidering that, or at the least have a user-set option for early stopping.


This is extremely cool/useful. Really appreciated how well the project is explained/motivated, e.g. the DESIGN.md file. Thanks for making it public!

Question about the long-term vision: do you see this as potentially being an educational artifact? Seems like just reading through the code could be a great way to get familiar with good ETL practices, good vectorization practices, etc.


The titanic data is a nice demo, but I’m kinda disappointed you didn’t try to predict WWMWD? Longer demo gif please!


Please ease up on acronyms or add the expanded form in parentheses. I have no idea what WWMWD is even after googling for it for a while.


Cool project! Have you compared your results to Ludwig by Uber? It would be interesting to see where the results differ.

https://uber.github.io/ludwig/


There are some discussions on whether data scientists are going to be replaced by automatic tools in the near future.

Can this be considered an example of a tool that partially replaces the work done by a data scientist? At least it can save a lot of time.


Whenever anyone asks this, I always wonder whether I live in a bubble or they do.

Creating simple predictive models where your problem is already easily narrowed down to a "given x predict y" definition is pretty trivial. Having it automated is nice, but not exactly a hard thing to do.

Genuine question: how many people have jobs where those kinds of problems form any significant part of their workload?

I also often see a response to this sentiment along the lines of, "Yeah, but there's also data cleaning..." etc. My reaction to this is mixed. I mean, sure, there is also data cleaning involved, but is this really where people spend most of their time?

My team spends most of our time doing the following:

1. Formulating problems. Figuring out the various different ways that a real-world problem can be expressed mathematically and feasibly attacked computationally.

2. Engineering software to implement the solutions to these problems, sometimes using some of the (amazing) frameworks out there for ML or probabilistic programming, but often having to develop our own approaches from scratch.

3. Doing all the management, stakeholder relationship stuff, business cases, etc. that make your work relevant and possible.

4. Getting data. Always an issue.

I'm very genuine in my curiosity here: are we total snowflakes, and most data scientists spend their time cleaning data and building "given X predict y" models?


How many business analysts/low level coders have jobs because they just implement the same repeated CRUD screens/wireframes or maintain WordPress themes? Not the same as data science, but close.


I think it's possible that many people's "cleaning data" has some overlap with your "Getting data".

I know for me I've had things like a bunch of scanned images of tables as "data". Turning that into something useful took a lot of time.

Whether this is "getting data" or "cleaning data" depends on perspectives and definitions.


Predictive modeling will be nearly automated (except in cases where manual feature engineering helps).

Data science will focus more attention on solution finding, data gathering, cleaning, ETL, and business.


I worked at a startup whose first service was "upload CSV, and we automatically generate interesting charts". That's not trivial, but it's not exactly rocket science, either. The main trouble we had is that in order for this to work well (and it makes for a terrific demo), you need to start with a great CSV file. The average CSV file you find on the web isn't.

CSV is about the barest amount of specification in a file format. It's common to run across files which are some weirdo encoding you can't easily detect, or are a mix of multiple encodings, or a mix of line endings, or which should be treated as case-insensitive (or only for some columns), or which have weird number formatting (or units, and not the same units in every row), or typos and spelling errors, or it came from an OCR'd PDF and there's "page 2" right in the middle of it, or they tried to combine multiple files together so there's multiple headers scattered throughout the file (or none at all), or the top has different columns from the bottom, or it uses quoting differently (obviously not per the RFC), or it's assumed that "nil"/"NULL"/""/"-"/"0" are the same, or ...

In short, data (which hasn't been cleaned by hand) sucks, and CSV doubly so. If you want to put your AI/ML smarts to work, write a program to take a shitty CSV file (or even better, a shitty PDF file!) and generate good clean data, plus a description of its schema. That would be an amazing tool.

So far, OpenRefine is the nicest tool for this that I've seen. Figure out how to make it fully automatic, and everybody with piles of raw data (governments) will beat a path to your door.


What was the name of that chart startup?

I have been thinking about making an open refine type tool for python. Every time I do data cleaning in python, it feels so repetitive.


In my mind, that is a hard question to assess. Hopefully, more tools will exist for automating data cleaning and ETL such as handling third-party data schemas and integrations, data errors, and so forth. I don't this means data science jobs will be eliminated. Rather, one data scientist should be able to handle integrating more data sources and investigating a larger number of models. New and novel data will continuously present itself though, and if the total volume of data continues to grow at its current rate, these tools might just allow for a data scientist to keep pass with the growth of data. Hopefully though, progress allows for everyone to focus on higher-level tasks versus cleaning data and building pipelines.


Modeling is just a small part of data science (the percentage of time I've spent modeling as a data science is in the single digits).

Automating modeling is a bit easier than automating the other parts, though.


This is fantastic! Thanks for putting this on Github. I am planning to build a [R] Shiny app for automl. This gives me good learning opportunity.


Very cool. Can you provide some details about how this tool architects models?


The best way to do that is IMO to look at the templates itself: https://github.com/minimaxir/automl-gs/tree/master/automl_gs...

tl;dr it uses the standard encoder-combiner-MLP-output results, but with a lot of variability in the process.


So, how do you survive a Titanic accident?


You probably want "old fashioned" regression models if you want to understand which predictors affected survival and how they affected it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: