https://repo2docker.readthedocs.io/en/latest/ :

> jupyter-repo2docker is a tool to build, run, and push Docker images from source code repositories.

> repo2docker fetches a repository (from GitHub, GitLab, Zenodo, Figshare, Dataverse installations, a Git repository or a local directory) and builds a container image in which the code can be executed. The image build process is based on the configuration files found in the repository.

> repo2docker can be used to explore a repository locally by building and executing the constructed image of the repository, or as a means of building images that are pushed to a Docker registry.

> repo2docker is the tool used by BinderHub to build images on demand

There are maintenance advantages and longer time to patch with kitchen-sink ML containers like kaggle/docker-python because it takes work to entirely bump all of the versions in the requirements specification files (and run integration tests to make sure code still runs after upgrading everything or one thing at a time).

What's best practice for including a sizeable dataset in a container (that's been recently re-) built with repo2docker?