How to optimize Node.js Docker image (Part 2)
Introduction
In the first part of Node-in-Docker optimization series we covered:
- Reducing the number of running processes
- Handling signals properly
- Making use of the build cache
- Using ENTRYPOINT
- Using EXPOSE to document exposed ports
This was the resulting Dockerfile we finished with:
DockerfileFROM node WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
In the second part of this article, we will cover:
- Reducing the image size by:
- Removing obsolete files
- Using a lighter base image
- Using labels (
LABEL
) - Adding semantics to labels
- Linting your Dockerfile
docker/basic
branch.
Reducing the Docker image file size
If we take a look at our Node image now, you'll find that it's huge (939MB to be exact).
bashdocker images demo-frontend:expose REPOSITORY TAG IMAGE ID SIZE demo-frontend expose 9ffa262cf2ce 939MB
$$$$
To deploy this image to a remote server, we must transfer at least 939 MB of data. Now, imagine a scenario where you need to roll back to a previous deployment in production, e.g. because of errors in the source code: if your image is large, there may be a noticeable downtime before the image finishes the transfer to the server and the rollback is complete.
Therefore, reducing the file size of our Docker image is important.
Removing obsolete files
If we examine the contents of our container, we'll find many source code files that were required by the build process, but not during the runtime:
bashdocker exec -it demo-frontend du -ahd1 16K ./dist 36K ./src 4.0K ./webpack.config.js 55M ./node_modules 15M ./.npm 4.0K ./package.json 164K ./package-lock.json 70M .
$$$$$$$$$$
In fact, out of the files above, only dist/
and node_modules/
are needed. The rest of files increase the container size and can be removed without a second thought.
A naive approach would be to add an extra RUN
instruction to remove these files.
DockerfileFROM node WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
Whilst this does get rid of some files, it does not reduce the filesize of our Node image. This is because images in Docker are built layer by layer: once a layer is added, it cannot be removed from the image. Therefore, adding an extra RUN
instruction will actually increase the image's filesize.
Another approach would be to combine build and cleanup steps into a single instruction:
DockerfileFROM node WORKDIR /root/ COPY [".", "./"] RUN ["/bin/sh", "-c", "npm install && npm run build && find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
And again: while this does reduce the image size, it undoes all the good work we've done leveraging the build cache.
What should we do, then? Use multi-stage builds to remove obsolete files, whilst still taking advantage of the build cache, of course.
What is multi-stage Dockerfile?
Multi-stage build is a Dockerfile feature introduced in v17.05 that allows you to specify multiple images (stages) within the same Dockerfile. More importantly, you are able to copy the build artifacts from one stage to another.
Therefore, inside our Dockerfile, we can have a builder stage, where we install development dependencies, build the application from the source code, splitting that process into multiple instructions to leverage the build cache. Then, we only copy the files required to run the image from the builder stage to the final image:
DockerfileFROM node as builder WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] FROM node WORKDIR /root/ COPY --from=builder /root/ ./ ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
Note that the COPY
instruction has been enriched with the --from
option to signify that it should copy files from the builder
stage instead of the build context.
If we build our image again, you'll see that we've already saved ~9 MB from the image. Not a small image per se, but enough for some start!
bashdocker build -t demo-frontend:multi-stage . docker images REPOSITORY TAG IMAGE ID SIZE demo-frontend multi-stage cf57206dc983 930MB <none> <none> 8874c0fec4c9 939MB
$$$$$$
The <none>:<none>
image is an intermediate builder stage image, which can be safely discarded, although doing so will also remove the cached layers.
Using a lighter base image
Even though we got rid of unnecessary build artifacts, 9 MB is not much relative to the size of the image. We can significantly reduce the size of the image by using a lighter base image.
At the moment, we are using the official node
base image, which itself is 904 MB:
bashdocker images node REPOSITORY TAG IMAGE ID SIZE node latest a9c1445cbd52 904MB
$$$$
This means no matter how much we minimize our demo-frontend
image, it will never get smaller than 904 MB. So why is it so large?
If we look inside the Dockerfile for the node
base image, we'll find that it's based on the buildpack-deps
image, which contains a large number of common Debian packages, including build tools, system libraries, and system utilities. We might need these utilities when building our demo-frontend
image, but we won't need them to run our node
process.
Fortunately, there's a variant of the image called node:alpine
. It is based off Linux Alpine, a small image at only 5.53 MB of size:
bashdocker images alpine REPOSITORY TAG IMAGE ID SIZE alpine latest 5cb3aa00f899 5.53MB
$$$$
The alpine
image doesn't include any build tools or libraries (it doesn't even have Bash!), allowing for a much smaller size than the node:latest
image:
bashdocker images node REPOSITORY TAG IMAGE ID SIZE node slim e52c23bbdd87 148MB node latest a9c1445cbd52 904MB node alpine 953c516e1466 76.1MB
$$$$$$
Therefore, the first stage would be updating our Dockerfile to use node:alpine
for our final image. At the same time, we need to keep node
for our builder
stage:
DockerfileFROM node as builder WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] FROM node:alpine WORKDIR /root/ COPY --from=builder /root/ ./ ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
When you build the image again, you should notice a drastic decrease in size:
bashdocker images demo-frontend:alpine REPOSITORY TAG IMAGE ID SIZE demo-frontend alpine 97373fdcb697 102MB
$$$$
Removing Intermediate Images
Multi-stage build is a great feature, as it allows you to keep images small and use the build cache at the same time. But this also means a lot of intermediate images are going to be generated.
These intermediate images are a type of dangling image, which are images that does not have a name. Generally, you should keep these dangling images, as they are the basis of the build cache. But having them littered in the terminal can be annoying; or if you are maintaining a CI/CD server, you may also want to clean up dangling images regularly.
You can output a list of dangling images by using the --filter
flag to the standard command:
bashdocker images --filter dangling=true REPOSITORY TAG IMAGE ID SIZE <none> <none> 8874c0fec4c9 939MB
$$$$
To remove dangling images, run:
bashdocker rmi $(docker images --filter dangling=true --quiet)
$$
However, this indiscriminately removes all dangling images. What if you just want to remove the images generated from a certain build? Enter the second stage: LABELS!
Using LABEL instruction
The LABEL instruction allows you to specify the metadata in your image as key-value pairs. You can use labels to:
- document contact details of the author and/or maintainer of the image
- check the build date of the image
- add licensing information
In our case, we can use labels to mark an image as intermediate and belonging to the demo-frontend
build:
DockerfileFROM node as builder LABEL name=demo-frontend LABEL intermediate=true WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] FROM node:alpine LABEL name=demo-frontend WORKDIR /root/ COPY --from=builder /root/ ./ ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
Now, when we run docker build
, it will already be labeled. And so we can use the labels to filter the output of the listing command:
bashdocker build -t demo-frontend:labels . docker images --filter label=name=demo-frontend REPOSITORY TAG IMAGE ID SIZE demo-frontend labels 6965537afe54 102MB <none> <none> 0cbce2a3844b 939MB
$$$$$$
It also allows us to remove the intermediate image of our demo-frontend
build(s) by running:
bashdocker rmi $(docker images --filter label=name=demo-frontend --filter label=intermediate=true --quiet)
$$
Adding semantics to labels
Above, we picked two strings – name
and intermediate
– as our label key values. This is fine for now, but what if the author of another Docker image decides to use these labels as well? This is why Docker recommends that all LABEL instructions should have keys that are namespaced with the reverse DNS name of the domain that you own. This will help avoid clashes in label key names. Therefore, we should update our labels accordingly:
DockerfileFROM node as builder LABEL works.buddy.name=demo-frontend LABEL works.buddy.intermediate=true WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] FROM node:alpine LABEL works.buddy.name=demo-frontend WORKDIR /root/ COPY --from=builder /root/ ./ ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
Whilst namespacing prevent label keys from clashing, it lacks a common semantics: how would a user know what works.buddy.intermediate
means? Or whether works.buddy.intermediate
conveys the same meaning as com.acme.intermediate
?
In the past, Docker users and organizations came up with multiple conventions for imposing semantics to label key names, including:
- Label Schema, which uses a shared
org.label-schema
namespace - Generic labels suggested by Project Atomic
However, both have been superseded by annotations defined in the OCI Image Format Specification.
This specification defines multiple pre-defined annotation keys, each prefixed with the org.opencontainers.image.
namespace.
For example, the annotation specification specifies that the org.opencontainers.image.title
label should be used to specify the "human-readable title of the image", and the org.opencontainers.image.vendor
label be used for the "name of the distributing entity, organization or individual".
So let's update the label keys in our Dockerfile with these standardized label keys wherever possible:
DockerfileFROM node as builder LABEL org.opencontainers.image.vendor=demo-frontend LABEL works.buddy.intermediate=true WORKDIR /root/ COPY ["package.json", "package-lock.json", "./"] RUN ["npm", "install"] COPY ["webpack.config.js", "./"] COPY ["src/", "./src/"] RUN ["npm", "run", "build"] RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"] FROM node:alpine LABEL org.opencontainers.image.vendor=demo-frontend LABEL org.opencontainers.image.title="Buddy Team" WORKDIR /root/ COPY --from=builder /root/ ./ ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"] EXPOSE 8080
Linting your Dockerfile
The last thing we will do in this article is to lint our Dockerfile. There are multiple tools for linting Dockerfiles, including Buddy's official Dockerfile Linter: https://github.com/buddy-works/dockerfile-linter
For the purposes of the guide, however, we'll use hadolint, with a brief mention of dockerfilelint
the end.
Hadolint
Hadolint parses the Dockerfile into an abstract syntax tree (AST), which is a structured object representing the contents of the Dockerfile. In concept, it's similar to how your browser parses HTML source code into the Document Object Model (DOM).
Hadolint then tests the AST against a list of rules to detect places in the Dockerfile which does not follow best practices. Let's run it against our Dockerfile to see where we can improve.
The easiest way to run hadolint is by running the hadolint/hadolint
image using Docker.
bashdocker pull hadolint/hadolint docker run --rm -i hadolint/hadolint < Dockerfile /dev/stdin:1 DL3006 Always tag the version of an image explicitly
$$$$
You can notice that Hadolint displayed the DL3006
error, which says that the first line of the Dockerfile (/dev/stdin:1
) should use a tagged image.
So let's update our FROM instruction to give the base image the latest tag:
dockerfileFROM node:latest as builder LABEL org.opencontainers.image.vendor=demo-frontend ...
Run the linter again. This time, it gives another error:
bashdocker run --rm -i hadolint/hadolint < Dockerfile /dev/stdin:1 DL3007 Using latest is prone to errors if the image will ever update. Pin the version explicitly to a release tag
$$$
This error informs us that we shouldn't use the latest
tag, as node:latest
can refer to different images over time. Instead, we should pick a more specific tag.
We can be as specific as possible and use a tag like 10.15.3-stretch
. However, I've found that using the lts
tag is just right, as it follows the latest Long Term Support (LTS) version of Node.js:
DockerfileFROM node:lts as builder LABEL org.opencontainers.image.vendor=demo-frontend ...
Now, when we run hadolint again, it finally doesn't generate any new errors.
There are two general rules to using hadolint:
- Rules which begin with
DL
imply errors in the Dockerfile syntax - Rules which begin with
SC
imply errors in some of the script(s) in the Dockerfile. These are picked up by another tool called ShellCheck, which performs static analysis on your shell scripts.
Using a Second Linter
Linting your Dockerfile ensures you are following best practices; but you don't have to limit yourself to a single linter! For instance, you can also use the dockerfilelint
npm package alongside hadolint.
Using dockerfilelint
with our pre-linted Dockerfile yields a similar result, although dockerfilelint
outputs in CLI format by default, which might be better for everyday use.
bashdockerfilelint Dockerfile File: Dockerfile Issues: 1 Line 1: FROM node as builder Issue Category Title Description 1 Clarity Base Image Missing Base images should specify a tag to use. Tag
$$$$$$$$$$
dockerfilelint
can also output as JSON, which may be advantageous for programmatic use.
bashdockerfilelint Dockerfile -o json | jq .files[0].issues [ { "line": "1", "content": "FROM node as builder", "category": "Clarity", "title": "Base Image Missing Tag", "description": "Base images should specify a tag to use." } ]
$$$$$$$$$$$
When the issues are fixed, this is the output from dockerfilelint
.
bashdockerfilelint Dockerfile File: Dockerfile Issues: None found đź‘Ť
$$$$$
Using multiple linters have the advantage of discovering errors missed by other linters. To finish up, let's build our image using the double-linted Dockerfile!
bashdocker build -t demo-frontend:oci-annotations .
$$
Next Step: Security
Although this article does not focus on image security, we've already improved it by moving our node
image to node:alpine
. This is because everything library and tool in the container has potential to be exploited in an attack. By reducing their number, we reduce the potential attack surface. The same principle applies to reducing the number of running processes in our container.
However, there's lot more we can do, and for this I invite you to the last article in my series.