Docker Image Size - Does It Matter?
Does Docker image size matter? The answer I usually hear is "yes". The logical question that follows is "why?". I've heard the following two answers too often for my liking:
- A smaller image takes less disk space.
- A large image is difficult to upload. It takes a long time.
Alhough both these statements sound like they make sense, neither is generally true.
There are some other legitimate reasons why it's desirable to have smaller Docker images, like reducing attack surface. But in this article, I want to address the urban legends mentioned above.
Smaller Docker images take up less disk space
A Docker image is composed of reusable layers, i.e. different images might share some layers. Because of such architecture, image size and disk consumption are not necessarily directly correlated.
If all layers in an image are used only in that image, then yes, image size represents the amount of disk space occupied by that image. However, this is an extremely uncommon situation. In all other cases, the math is not that simple. So, let's take a look at the numbers.
We'll go over two examples. In both of them, we'll observe a 1 GB image with 2 layers, one taking 20 MB, and the other one 980 MB. The number of images and their sizes are not representative. They've been chosen to make the point obvious.
If our image is the only one using these two layers, then yes, that image occupies 1 GB of disk space. But what happens if there are 10 similar images - images from the same repo with different tags? Well, it depends...
In the first example, all these images differ only in the smaller layer (the 20 MB one), and share the lager (980 MB) one. The total disk consumption is: 980 + 1020 = 980 + 200 = 1180 MB. This is 118 MB per image, or *11.8% of the image size reported by the docker image command.
In the second example, all these images differ only in the larger layer (the 980 MB one) and share the smaller layer (the 20 MB one). The total disk consumption is: 20 + 10980 = 20 + 9800 = 9820 MB. This is 982 MB per image, or *98.2% of image size.
In real-life situations, the 'shared' layer from the examples represents the image layers that seldom change (base image + top imutable layers). The layer that's different in each image in the example above represents the frequently changing layers in images - the bottom layers that change in each build.
Here are our takeaways from these examples regarding disk usage:
- Total image size does not matter.
- Base image size does not matter. Thus, reducing base image size in an attempt to reduce total disk consumption is meaningless (except in some borderline cases when the disk is unrealistically small).
- What really matters when it comes to disk usage is the size of frequently changing layers.
As a side note: reducing base image size usually comes at a price - the smaller the image size, the smaller the functionality.
A large Docker image is difficult (takes a lot of time) to upload
When a Docker image is uploaded or downloaded (pushed or pulled), each image layer is transferred separately. The layers that already exist at the destination are not transferred at all.
So, the amount of data that needs to be transferred is not directly correlated with the overall image size or the size of any image layer. The transferred data size is equal to the sum of layer sizes that do not exist at the destination, i.e. the size the layers that have been changed.
In plain English: the first transfer takes the hit. The rest of them transfer only new (frequently changing) layers. And their size depends on the image structure.
A somewhat exceptional situation is a public CI/CD service. In such setups, projects usually do not have dedicated nodes, so it's often the case that each build/test run is the first run on that (probably virtual) node. That is, the first Docker transfer in each run is the first transfer on the node. Because of that, the entire image has to be transferred each time. Fortunately, even ~GB downloads do not add significant overhead to the usually lengthy build/test process. Also, the overall speed of the Docker pull-build-push cycle can be furthere improved by using layer caching.
Easily manageable images
People like easily manageable images, which usually means images that are fast to upload/download and images that do not occupy too much disk space.
In order to create such an image, we shouldn't strive to reduce the base or overall image size, but to properly design the Dockerfile. As shown above, what matters is the size of the frequently changing layers. And that's what should be minimized.
In order to minimize these frequently changing layers, files should be grouped into layers based on their modification frequency. The layers that are changed most frequently should be created last. For example, it's not optimal to have a Dockerfile that looks like this:
FROM whatever:latest COPY . ./ ...
The images created from such a Dockerfile will have all the layers above the base image replaced on each rebuild, and will be 'heavy'. Instead, layers containing dependencies should be added before the layers with project code, since the latter change much more often.
Conclusion: why Docker image size matters
Having a smaller Docker image is generally desirable, everybody agrees on that. It's desirable for a number of reasons like reducing attack surface, and not because of disk usage or upload/download time.
If images are properly constructed, one-time hits incurred by big Docker image size are amortized by subsequent multiple uploads/downloads. That is, if the size of the frequently changing layers is small.
If you'd like to continuously deliver your applications made with Docker, try Semaphore's platform with full layer caching for tagged Docker images.