Machine Learning Containers Are Bloated and Vulnerable

namaria · on May 10, 2023

We need an update of Parkinson's law:

Software expands to occupy all hardware available. If you throw more hardware at the problem, you get bloated software.

It happens everywhere. Powerful clients led to bloated web apps. Huge data centers led to "micro service and devops" bloat. How many layers of software do I need to run code that actually handles business logic? 'Everything-as-code' sounds great when your tool box is knowing how to write code.

95% of the actual, real world problems could be solved by spreadsheets and email. But when you pump productivity gains into capital returns and keep people working the same hours as a century before, busywork goes through the roof.

Bloat everywhere.

vector_rotcev · on May 10, 2023

Namaria's Law:

"Software expands to occupy all hardware available. If you throw more hardware at the problem, you get bloated software."

chaxor · on May 11, 2023

Actually, I think this problem is going to be solved easily with GPT, since it writes really concise and performant code. Using LLMs to generate scripts is so much more efficient and will make this bloat problem fade away.

\/s

namaria · on May 11, 2023

"The model must expand"

chaxor · on May 11, 2023

More is always better. That's people like LLMs. It takes them *decades* to process a database on consumer hardware, whereas more targeted system take merely seconds. Obviously decades > seconds, therefore: LLMs > smaller targeted architectures

jiggawatts · on May 10, 2023

I'm shocked they work at all.

For example, the very popular oobabooga /text-generation-webui repo on GitHub has a Dockerfile that simply won't build.

Oh, maybe it was possible to build it once at some point in time, but then some Python package got updated, and it's game over.

It's astonishing how fragile the reproducibility of everything in the Python and ML world is. Even with the help of Docker, things break within weeks, and then stay broken.

gamon9 · on May 10, 2023

I built this earlier today, ran into three problems along the way.

1) didn't link the files up to the root - (in the docs - didn't read)

2) needed to specify my CUDA version in .env

3) using docker on WSL and had a hard memory limit set so the build got oomkilled

Now it's working fine

westurner · on May 10, 2023

https://repo2docker.readthedocs.io/en/latest/ :

> jupyter-repo2docker is a tool to build, run, and push Docker images from source code repositories.

> repo2docker fetches a repository (from GitHub, GitLab, Zenodo, Figshare, Dataverse installations, a Git repository or a local directory) and builds a container image in which the code can be executed. The image build process is based on the configuration files found in the repository.

> repo2docker can be used to explore a repository locally by building and executing the constructed image of the repository, or as a means of building images that are pushed to a Docker registry.

> repo2docker is the tool used by BinderHub to build images on demand

There are maintenance advantages and longer time to patch with kitchen-sink ML containers like kaggle/docker-python because it takes work to entirely bump all of the versions in the requirements specification files (and run integration tests to make sure code still runs after upgrading everything or one thing at a time).

What's best practice for including a sizeable dataset in a container (that's been recently re-) built with repo2docker?

PaulHoule · on May 9, 2023

… I identified this bloat as a big problem a few years ago. People want to pack up a machine learning model into a system like pip, maven and/or Docker and this frequently means a foundation model (1 gig+) gets copied several times in the process of deployment. In a production environment where the model gets loaded occasionally this is bad enough but in a dev environment it can turn a 10 second process into a several minute process…

m463 · on May 10, 2023

I wish it was common practice to unbloat containers, but it seems needlessly hard to do and undervalued.

Most containers are built with layers and removing from one layer doesn't make things smaller.

older versions of docker don't help with squashing (except with experimental turned on)

I think you might be able to do with multi-stage builds and COPY

but... it's not valued

PaulHoule · on May 10, 2023

I think the model that you build a layer on top of a specific layer which is on top of another layer is part of the problem. If you could replace a layer without having to rebuild the layers that depend on that would make a difference.

Timon3 · on May 10, 2023

I very much enjoy dive[1] for this purpose, though it doesn't seem to be maintained anymore...

[1] https://github.com/wagoodman/dive

radonek · on May 11, 2023

I have recently had unfortunate experience of being tasked to dockerize python project depending on ML "library" (namely, rembg). I hope to never, ever touch ML crowd crap again.