I’m new to Python and Docker, learning a lot these days. I’m facing the problem of learning how to dockerize my Python app and I wanted to reduce his size. Due to some scientific depedencies the image is huge. I read about the possibility to build core scientific libraries like numpy, scipy, pandas, etc. from source with custom building arguments, e.g. by removing symbols and debug stuff or by selecting the blas/palack libraries, e.g. openblas (small) vs MLK (huge). Please see
here
for a reference.
Anyway I am struggling to do this through the pip python package manager.
To date, what I tried is installing some building dependencies, for instance (numpy and scipy deps):
build-essential
cmake
ninja-build
gfortran
pkg-config
python-dev
libopenblas-dev
liblapack-dev
cython3
patchelf
libatlas-base-dev
libffi-dev
And the tried to use pip as follows (tried, for instance, just numpy from source):
pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --use-pep517 --check-build-dependencies --no-binary numpy --global-option=build_ext --global-option=“-g0” --global-option=“-Wl,–strip-all” --requirement dependencies.txt
I am strugging to unstarstand:
The difference between the --compile and the --no-binary options of pip install. Which should I use?
The use and implifications of the option --no-build-isolation, should I exploit it in my case?
How to drop the option --global-option (being deprecated as I understood) and still passing custom build arguments. Should I exploit the option --config-settings? Where can I find real world examples of using it in different ways?
Can I do the above per package, e.g. custom options for building numpy and different options for building scipy?
How can I speed-up the building time? Should I investigate the -j4 option for istance?
Faced with this problem I would not use pip at all.
I would write scripts to take the sources code of that you want to compile and use their build instructions to build them as you need.
You would build the dependencies and then supply them to the higher level packages as input.
What OS are you working on?
Also does the size of the docker image really matter vs. the time you will invest in creating a smaller version?
Exactly – pip is not a build tool – it is a package manager that is overloaded to provide an interface to build tools, for an easier one-stop-shopping experience. In your case, you want to call the build tool directly (which I think is still a heavily patched setuptools/distutils for numpy / scipy, but I may be wrong). You probably want to build wheels and install those in your docker image.
Also – you might want to give conda a look-see – if the packges in conda-forge aren’t quite what you need, you could rebuild just a couple.
Also – scipy is pretty darn big – and you are probably only using a small bit of it, but it’s hard to know exactly which bits – I’ve tried to spilt it up in the past and it hasn’t been worth it

Christopher H.Barker, PhD:
(which I think is still a heavily patched setuptools/distutils for numpy / scipy, but I may be wrong
FYI, nowadays as I understand SciPy has switched and NumPy is switching to the new standards-based Meson-Python, and NumPy.distutils is deprecated.

Luca Maurelli:
Due to some scientific depedencies the image is huge. I read about the possibility to build core scientific libraries like numpy, scipy, pandas, etc. from source with custom building arguments, e.g. by removing symbols and debug stuff or by selecting the blas/palack libraries, e.g. openblas (small) vs MLK (huge).
Let me first point out that the article you’re linking is from 2018. A lot has changed in the meantime. In general:
Wheels on PyPI are already stripped and do not contain debug symbols. If they do, that’d be a bug, so if you see this for any of the packages you are trying to build here, please open an issue on the project’s issue tracker.
Wheels do not include MKL, so at best you can get rid of a duplicate copy of
libopenblas
and
libgfortran
- perhaps shaving off 40 MB from your final Docker image.
If you really need small images, then build with Musl rather than glibc. So use Alpine Linux as a base for example.
The difference between the --compile and the --no-binary options of pip install. Which should I use?
The use and implifications of the option --no-build-isolation, should I exploit it in my case?
How to drop the option --global-option (being deprecated as I understood) and still passing custom build arguments. Should I exploit the option --config-settings? Where can I find real world examples of using it in different ways?
Can I do the above per package, e.g. custom options for building numpy and different options for building scipy?
How can I speed-up the building time? Should I investigate the -j4 option for istance?
Use
--no-binary
Yes, you should use
--no-build-isolation
here. The dependencies in
pyproject.toml
contain
numpy ==
pins that are specific to building binaries for redistribution on PyPI (handwaving, there’s a bit more to that - not too relevant here). If you are building your whole stack from source and then deploy that, you want to build against the packages in that stack. With build isolation, you’d instead be pulling different versions down from PyPI.
Yes, you should
--config-settings
. Its interface is unfortunately pretty cumbersome, and there isn’t a canonical place with good examples AFAIK. Here is one for SciPy:
http://scipy.github.io/devdocs/dev/contributor/meson_advanced.html#select-a-different-blas-or-lapack-library
. I’ll note that as of today, using
python -m build
to build wheels, and then
pip
to install them, is nicer (will be solved in a next
pip
release).
You can specify different options per package. It’s anyway a good idea to build packages one by one in this case.
You can pass
-j4
in
--config-settings
indeed. For
scipy
this isn’t needed, the Meson build will use all available cores by default. For anything still using
setup.py
based builds, you need to manually control it.

Barry Scott:
Also does the size of the docker image really matter vs. the time you will invest in creating a smaller version?
It may, unfortunately. For example, AWS Lamba has size constraints that can cause your application to fail to upload. Last time I tried was a few years ago, but then numpy + scipy + pandas was already very close to the limit.

Christopher H.Barker, PhD:
Exactly – pip is not a build tool – it is a package manager that is overloaded to provide an interface to build tools, for an easier one-stop-shopping experience. In your case, you want to call the build tool directly
This advice is incorrect. You should be using a build frontend, so either
build
or
pip
. Invoking
setup.py
(or
meson
, or …) usually still works may cause some subtle problems, like missing
.dist-info
metadata, which then can cause for example
importlib.resources
to not work. There is no reason to avoid
pip
or
build
here - it’s just a way to say “build me a wheel”.
Let me also point out that when you’re building a conda package, that will do the exact same thing - invoke
pip
under the hood (look at the average conda-forge recipe to confirm that).

Ralf Gommers:
Let me also point out that when you’re building a conda package, that will do the exact same thing - invoke
pip
under the hood (look at the average conda-forge recipe to confirm that).
Also, this is only peripherally related but I spent > 5 minutes figuring out where this happens (and sharing for posterity); conda-build sets the appropriate environment variable such that pip
doesn’t
use build isolation, i.e. it uses the conda environment that the recipe provisions.
I am working with a Debian slim docker image at the moment. I will consider switching to Alpine Linux later on if that is available with little effort or consider also a building a distroless image and trasfering only needed packages, e.g. see
here
for a reference. Although this approach seems not mature enough for production yet.
The time spent on is greater than the results but I’m learning these technologies so it’s nice to dig a bit and learn more. The know-how is easily trasferable in many applications. Also note that some cloud providers has size limits like AWS Lambda functions and/or you pay for every pull/push into a private cloud image registry, so minimizing the size can impact your deployment/running costs.

Luca Maurelli:
Would you mind elaborating a bit more about splitting scipy up and give me some reference on how to do it?
I wish I could, but I don’t think there is any such reference. Way back when, there was some discussion of breaking scip up into sub-packages, but for teh most part:
The user experience for most is much better if there a single step install (and import) of scipy gets you a whole bunch of stuff – that’s kinda what scipy is for – after all we have numpy as a core pacakge already.
while quite a bit of scipy is optional, there is also a fair bit of inter-dependence – so hard to know where to draw a line.
So you kind a need o do it by hand – hand -build (copy and pasting from scipy) a package that has only what you need, and keep adding stuff until it works
In my case, I only ended up doing this when I literally only needed perhaps a couple function is scipy.special, for instance.

PythonCHB:
Exactly – pip is not a build tool – it is a package manager that is overloaded to provide an interface to build tools, for an easier one-stop-shopping experience. In your case, you want to call the build tool directly
This advice is incorrect. You should be using a build frontend, so either
build
or
pip
. Invoking
setup.py
(or
meson
, or …) usually still works may cause some subtle problems, like missing
.dist-info
metadata, which then can cause for example
importlib.resources
to not work. There is no reason to avoid
pip
or
build
here - it’s just a way to say “build me a wheel”.
Calling the build tool directly is fine
if the build tool says you can do it
. Some do say that (e.g.
flit
), some say not to do it (e.g.
setuptools
, and apparently
meson
?)
I deliberately built
pymsbuild
to be invoked directly for developer tasks that are more complex than “just turn these sources into a wheel”.
But I think the advice everyone was trying to get at here is that the packages probably need to be built
independently
. That is, when there’s a chain of native dependencies, build each one directly and keep building on top of those, rather than trying to grab the end of the chain and expecting pip to sort it out.
Importantly, right now if you’re using
--config-settings
there’s really no good way to make sure your settings only apply to the package you care about (and not its dependencies, and potentially not even build dependencies if they have to be built). So you may have to understand the dependency chain first and use a separate build for each one.

Steve Dower:
Calling the build tool directly is fine
if the build tool says you can do it
. Some do say that (e.g.
flit
), some say not to do it (e.g.
setuptools
, and apparently
meson
?)
It’s fine for developer tasks to call
meson
directly, as you mentioned (and
setup.py
too for that matter - but you need to know what you’re doing exactly at that point, and what you’re missing out on). But that didn’t really seem to be the question here. Meson doesn’t have an opinion - it does have a very good CLI that’s meant to be used, but to get
.dist-info
you need
meson-python
which does not have a CLI and can only be used via
pip
/
build
.
I am responding here and not directly to you because your discussion is beyond my expertise.
I report some personal considerations:
Using the command:
–check-build-dependencies
causes a dump as it detects unresolvable conflicts for me, see for example:
ERROR: Some build dependencies for scipy from … conflict with the backend dependencies: numpy==1.24.2 is incompatible with numpy==1.19.5; … , pybind11==2.10.3 is incompatible with pybind11==2.10.1.
If, as I believe, I must choose all package versions in order to have a compatible stack, I will not force this check since I am unable to understand the implications.
System packages and pip packages are different. I found myself having to install
Cython
with pip even though I had installed
cython3
from apt-get previously.
Some dependencies are not straightforward fixable: a missing package
mesonpy
was fixed by installing a package with different name
meson-python
or installing both scipy and scikit-learn fails as scikit-learn is not able to recognize the building chain order (?) (scipy was already processed but not found): I don’t know if that is caused because scikit-learn needs scipy to be installed already.
Anyway given these problems and time spent on, I guess this is a rabbit hole for me, considering also my lack of competences.
At the moment I am trying to build like this:
# Create virtual enviroment without bootstrapped pip
#!!! TODO: check if setuptools is bootstrapped, if yes then should be deleted to optimize image size
# https://docs.python.org/3/library/venv.html
RUN python -m venv --without-pip ${VIRTUAL_ENV}
# tools needed to build requirements from source:
# https://docs.scipy.org/doc//scipy-1.4.1/reference/building/linux.html
# https://numpy.org/doc/stable/user/building.html
# https://numpy.org/install/
# https://packages.debian.org/source/stable/cython
RUN set -eux \
&& buildScietificPackagesDeps=' \
build-essential \
cmake \
ninja-build \
gfortran \
pkg-config \
python-dev \
libopenblas-dev \
liblapack-dev \
#cython3 \
#patchelf \
autoconf \
automake \
libatlas-base-dev \
# TODO: check if python-ply is needed
python-ply \
libffi-dev \
&& apt-get update \
&& apt-get install -y --no-install-recommends $buildScietificPackagesDeps
# Install dependencies list
# --prefix
# used to install inside virtual enviroment path
# --use-pep517 --check-build-dependencies --no-build-isolation
# used to solve https://github.com/pypa/pip/issues/8559
# "# DEPRECATION: psycopg2 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed"
# --compile --global-option=build_ext --global-option=-g0 --global-option=-Wl
# used to pass flags to C compiler and compile to bytecode from source, see:
# https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
# https://blog.mapbox.com/aws-lambda-python-magic-e0f6a407ffc6
# https://pip.pypa.io/en/stable/cli/pip_install/#options
RUN pip install --upgrade --no-cache-dir pip wheel setuptools Cython meson-python pythran pybind11 \
&& pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --use-pep517 --no-build-isolation \
#--check-build-dependencies \
--requirement dependencies.txt \
# https://discuss.python.org/t/how-to-use-pip-install-to-build-some-scientific-packages-from-sources-with-custom-build-arguments/
# https://github.com/pypa/pip/issues/11325
--no-binary numpy,scipy,pandas --config-settings="build_ext=-j4"\
&& pip cache purge
I will report the results when the build ends. Feel free to share your opinions, thanks.
building numpy, scipy, and pandas with --no-build-isolation
(as stated in the above script) did not reduce the size of the image, I saved just 1 MB. I could try few other tests if suggestions are provided (they are welcome
)
Educating myself about python bytecode I found useful incorporating the --no-compile
alongside --no-cache-dir
during pip install
commands. This trick saved me around 180 MB, which is huge. The donwside is that I have to compile .py into .pyc during run-time for the execution. I saw the official Python docker image also strip off .pyc files, I guess the benefit of the size is better than the performance hit, if there is any. I have to check if importing the needed functions from modules is a best practice in this sense, in order to compile into bytecode only what is actually needed.
EDIT: actually the image does not work, failing to import the numpy module. 
Arrggh! These issues are being actively talked about in other threads, but the short version is: don’t use the system Python, at least not without virtual environments.
Note that conda makes a point of installing python packages in a pip-compatible way just for this reason.
Also – if you are doing custom builds, I would probably use --no-deps anyway.
Great thread, thanks to all for the constructive discussion. I’ve upgraded my build Dockerfile for SciPy and NumPy which was modelled on the same article (the one from 5 years ago), targetting Amazon Linux 2 (a “layer” for a packaged AWS Lambda microservice). The build process from source enables it to fit within the AWS Lambda size constraints, which I appreciate is a bit of a misuse of the purpose but got to do what you’ve got to do to deploy! 
There was a bit of a cascade of build dependency requirements with recent version updates’ requirements (cmake, gcc, etc.) so a significant portion of the dependencies had to be built from source in the Docker image.
I hope nobody minds if I share this here where it will might be visible to others facing a similar challenge, or for comparison against the solution above. There were a few other places describing this problem (e.g. here) which trailed off without a clear indication of whether they solved it or not.
I haven’t found the magic combination of options that works yet, but at least I think you’ve set me on the right path now, thanks.
Basic setup and Yum packages installed before build:
FROM mlupin/docker-lambda:python3.10-build AS build
USER root
WORKDIR /var/task
# https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
ENV CFLAGS "-g0 -Wl,--strip-all -DNDEBUG -Os -I/usr/include:/usr/local/include -L/usr/lib64:/usr/local/lib64:/usr/lib:/usr/local/lib"
RUN yum install -y wget curl git nasm openblas-devel.x86_64 lapack-devel.x86_64 python-dev file-devel make Cython libgfortran10.x86_64 openssl-devel
# Download and install CMake
WORKDIR /tmp
ENV CMAKE_VERSION=3.26.4
# Download and install CMake
RUN wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz
RUN tar -xvzf cmake-${CMAKE_VERSION}.tar.gz
RUN cd cmake-${CMAKE_VERSION} && ./bootstrap && make -j4 && make install
# Clean up temporary files
RUN rm -rf /tmp/cmake-${CMAKE_VERSION}
RUN rm /tmp/cmake-${CMAKE_VERSION}.tar.gz
WORKDIR /var/task
RUN pip install --upgrade pip
RUN pip --version
# Specify the version to use for numpy and scipy
ENV NUMPY_VERSION=1.24.3
ENV SCIPY_VERSION=1.10.1
# Download numpy and scipy source distributions
RUN pip download --no-binary=:all: numpy==$NUMPY_VERSION
# Upgrade GCC to version 8 for SciPy Meson build system
RUN wget https://ftp.gnu.org/gnu/gcc/gcc-8.4.0/gcc-8.4.0.tar.gz && \
tar xf gcc-8.4.0.tar.gz && \
rm gcc-8.4.0.tar.gz && \
cd gcc-8.4.0 && \
./contrib/download_prerequisites && \
mkdir build && \
cd build && \
../configure --disable-multilib && \
make -j$(nproc) && \
make install && \
cd / && \
rm -rf gcc-8.4.0
# Set environment variables
ENV CC=/usr/local/bin/gcc
ENV CXX=/usr/local/bin/g++
ENV FC=/usr/local/bin/gfortran
# Verify GCC version
RUN gcc --version
RUN /usr/local/bin/gfortran --version
# Extract the numpy package and build the wheel
RUN pip install Cython
RUN ls && tar xzf numpy-$NUMPY_VERSION.tar.gz
RUN ls && cd numpy-$NUMPY_VERSION && python setup.py bdist_wheel build_ext -j 4
ENV BUILT_NUMPY_WHEEL=numpy-$NUMPY_VERSION/dist/numpy-$NUMPY_VERSION-*.whl
RUN ls $BUILT_NUMPY_WHEEL
NumPy and SciPy build (for simplicity I installed a wheel with the same version of NumPy as I was building from source, the wheel being purely for building SciPy)
# Don't install NumPy from the built wheel but use same version (it's a SciPy dependency)
RUN pip install numpy==$NUMPY_VERSION
RUN python -c "import numpy"
# Install build dependencies for the SciPy wheel
RUN pip install pybind11 pythran
# Extract the SciPy package and build the wheel
# RUN wget https://github.com/scipy/scipy/archive/refs/tags/v$SCIPY_VERSION.tar.gz -O scipy-$SCIPY_VERSION.tar.gz
RUN git clone --recursive https://github.com/scipy/scipy.git scipy-$SCIPY_VERSION && \
cd scipy-$SCIPY_VERSION && \
git checkout v$SCIPY_VERSION && \
git submodule update --init
RUN cd scipy-$SCIPY_VERSION && python setup.py bdist_wheel build_ext -j 4
ENV BUILT_SCIPY_WHEEL=scipy-$SCIPY_VERSION/dist/SciPy-*.whl
RUN ls $BUILT_SCIPY_WHEEL
# Install the wheels with pip
# (Note: previously this used --compile but now we already did the wheel compilation)
RUN pip install --no-compile --no-cache-dir \
-t /var/task/np_scipy_layer/python \
$BUILT_NUMPY_WHEEL \
$BUILT_SCIPY_WHEEL
RUN ls /var/task/np_scipy_layer/python
# Clean up the sdists and wheels
RUN rm numpy-$NUMPY_VERSION.tar.gz
RUN rm -r numpy-$NUMPY_VERSION scipy-$SCIPY_VERSION
# Uninstall non-built numpy after building the SciPy wheel
RUN pip uninstall numpy -y
RUN cp /var/task/libav/avprobe /var/task/np_scipy_layer/ \
&& cp /var/task/libav/avconv /var/task/np_scipy_layer/
RUN cp /usr/lib64/libblas.so.3.4.2 /var/task/np_scipy_layer/lib/libblas.so.3 \
&& cp /usr/lib64/libgfortran.so.4.0.0 /var/task/np_scipy_layer/lib/libgfortran.so.4 \
&& cp /usr/lib64/libgfortran.so.5.0.0 /var/task/np_scipy_layer/lib/libgfortran.so.5 \
&& cp /usr/lib64/liblapack.so.3.4.2 /var/task/np_scipy_layer/lib/liblapack.so.3 \
&& cp /usr/lib64/libquadmath.so.0.0.0 /var/task/np_scipy_layer/lib/libquadmath.so.0 \
&& cp /usr/lib64/libmagic.so.1.0.0 /var/task/np_scipy_layer/lib/libmagic.so.1 \
&& cp /usr/local/lib/libmp3lame*.so* /var/task/np_scipy_layer/lib \
&& cd /var/task/np_scipy_layer \
&& zip -j9 np_scipy_layer.zip /var/task/np_scipy_layer/avconv \
&& zip -j9 np_scipy_layer.zip /var/task/np_scipy_layer/avprobe \
&& zip -r9 np_scipy_layer.zip magic \
&& zip -r9 np_scipy_layer.zip python \
&& zip -r9 np_scipy_layer.zip lib