I worked at a startup whose first service was "upload CSV, and we automatically generate interesting charts". That's not trivial, but it's not exactly rocket science, either. The main trouble we had is that in order for this to work well (and it makes for a terrific demo), you need to start with a great CSV file. The average CSV file you find on the web isn't.
CSV is about the barest amount of specification in a file format. It's common to run across files which are some weirdo encoding you can't easily detect, or are a mix of multiple encodings, or a mix of line endings, or which should be treated as case-insensitive (or only for some columns), or which have weird number formatting (or units, and not the same units in every row), or typos and spelling errors, or it came from an OCR'd PDF and there's "page 2" right in the middle of it, or they tried to combine multiple files together so there's multiple headers scattered throughout the file (or none at all), or the top has different columns from the bottom, or it uses quoting differently (obviously not per the RFC), or it's assumed that "nil"/"NULL"/""/"-"/"0" are the same, or ...
In short, data (which hasn't been cleaned by hand) sucks, and CSV doubly so. If you want to put your AI/ML smarts to work, write a program to take a shitty CSV file (or even better, a shitty PDF file!) and generate good clean data, plus a description of its schema. That would be an amazing tool.
So far, OpenRefine is the nicest tool for this that I've seen. Figure out how to make it fully automatic, and everybody with piles of raw data (governments) will beat a path to your door.
CSV is about the barest amount of specification in a file format. It's common to run across files which are some weirdo encoding you can't easily detect, or are a mix of multiple encodings, or a mix of line endings, or which should be treated as case-insensitive (or only for some columns), or which have weird number formatting (or units, and not the same units in every row), or typos and spelling errors, or it came from an OCR'd PDF and there's "page 2" right in the middle of it, or they tried to combine multiple files together so there's multiple headers scattered throughout the file (or none at all), or the top has different columns from the bottom, or it uses quoting differently (obviously not per the RFC), or it's assumed that "nil"/"NULL"/""/"-"/"0" are the same, or ...
In short, data (which hasn't been cleaned by hand) sucks, and CSV doubly so. If you want to put your AI/ML smarts to work, write a program to take a shitty CSV file (or even better, a shitty PDF file!) and generate good clean data, plus a description of its schema. That would be an amazing tool.
So far, OpenRefine is the nicest tool for this that I've seen. Figure out how to make it fully automatic, and everybody with piles of raw data (governments) will beat a path to your door.