Software expands to occupy all hardware available. If you throw more hardware at the problem, you get bloated software.
It happens everywhere. Powerful clients led to bloated web apps. Huge data centers led to "micro service and devops" bloat. How many layers of software do I need to run code that actually handles business logic? 'Everything-as-code' sounds great when your tool box is knowing how to write code.
95% of the actual, real world problems could be solved by spreadsheets and email. But when you pump productivity gains into capital returns and keep people working the same hours as a century before, busywork goes through the roof.
Actually, I think this problem is going to be solved easily with GPT, since it writes really concise and performant code.
Using LLMs to generate scripts is so much more efficient and will make this bloat problem fade away.
More is always better. That's people like LLMs. It takes them *decades* to process a database on consumer hardware, whereas more targeted system take merely seconds. Obviously decades > seconds, therefore: LLMs > smaller targeted architectures
For example, the very popular oobabooga /text-generation-webui repo on GitHub has a Dockerfile that simply won't build.
Oh, maybe it was possible to build it once at some point in time, but then some Python package got updated, and it's game over.
It's astonishing how fragile the reproducibility of everything in the Python and ML world is. Even with the help of Docker, things break within weeks, and then stay broken.
> jupyter-repo2docker is a tool to build, run, and push Docker images from source code repositories.
> repo2docker fetches a repository (from GitHub, GitLab, Zenodo, Figshare, Dataverse installations, a Git repository or a local directory) and builds a container image in which the code can be executed. The image build process is based on the configuration files found in the repository.
> repo2docker can be used to explore a repository locally by building and executing the constructed image of the repository, or as a means of building images that are pushed to a Docker registry.
> repo2docker is the tool used by BinderHub to build images on demand
There are maintenance advantages and longer time to patch with kitchen-sink ML containers like kaggle/docker-python because it takes work to entirely bump all of the versions in the requirements specification files (and run integration tests to make sure code still runs after upgrading everything or one thing at a time).
What's best practice for including a sizeable dataset in a container (that's been recently re-) built with repo2docker?
… I identified this bloat as a big problem a few years ago. People want to pack up a machine learning model into a system like pip, maven and/or Docker and this frequently means a foundation model (1 gig+) gets copied several times in the process of deployment. In a production environment where the model gets loaded occasionally this is bad enough but in a dev environment it can turn a 10 second process into a several minute process…
I think the model that you build a layer on top of a specific layer which is on top of another layer is part of the problem. If you could replace a layer without having to rebuild the layers that depend on that would make a difference.
I have recently had unfortunate experience of being tasked to dockerize python project depending on ML "library" (namely, rembg). I hope to never, ever touch ML crowd crap again.
Software expands to occupy all hardware available. If you throw more hardware at the problem, you get bloated software.
It happens everywhere. Powerful clients led to bloated web apps. Huge data centers led to "micro service and devops" bloat. How many layers of software do I need to run code that actually handles business logic? 'Everything-as-code' sounds great when your tool box is knowing how to write code.
95% of the actual, real world problems could be solved by spreadsheets and email. But when you pump productivity gains into capital returns and keep people working the same hours as a century before, busywork goes through the roof.
Bloat everywhere.