Optimizing Dockerfile for Node.js (Part 2)

Optimizing Dockerfile for Node.js (Part 2)

November 4, 2020

In the first part of this article, we covered:

  • Reducing the number of running processes
  • Handling signals properly
  • Making use of the build cache
  • Using ENTRYPOINT
  • Using EXPOSE to document exposed ports

This was the resulting Dockerfile we finished with from Part 1:

FROM node
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

In the second part of this article, we will cover:

  • Reducing the Docker Image file size by:
    • Removing Obsolete Files
    • Using a lighter base image
  • Using labels (LABEL)
  • Adding Semantics to Labels
  • Linting your Dockerfile

Reducing the Docker Image file size

If we take a look at our image now, you'll find that it's huge (939MB to be exact).

$ docker images demo-frontend:expose
REPOSITORY     TAG     IMAGE ID      SIZE
demo-frontend  expose  9ffa262cf2ce  939MB

For us to deploy this image to a remote server and run it, at least 939MB must be transferred over. Imagine a scenario where you need to rollback to a previous deployment in production; if your Docker image is large, there may be a noticeable downtime before the Docker image finish being transferred onto the servers and the rollback is complete. Therefore, reducing the file size of our Docker image is important.

Removing Obsolete Files

If we examine the contents of our container, we will find many files that were required for the build process, but not during runtime.

$ docker exec -it demo-frontend du -ahd1
16K    ./dist
36K    ./src
4.0K   ./webpack.config.js
55M    ./node_modules
15M    ./.npm
4.0K   ./package.json
164K   ./package-lock.json
70M    .

In fact, out of the files above, only dist/ and node_modules/ are needed. We should remove the rest.

A naive approach would be to add an extra RUN instruction to remove these files.

FROM node
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

Whilst this does get rid of the files, it does will not reduce the file size of our image. This is because Docker images are built layer by layer; once a layer is added, it cannot be removed from the image. Adding an additional RUN instruction will actually increase the image's file size.

Another approach would be to combine the build and cleanup steps into a single instruction.

FROM node
WORKDIR /root/
COPY [".", "./"]
RUN ["/bin/sh", "-c", "npm install && npm run build && find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

Whilst this does reduce the image size, it undoes all the good work we've done leveraging the build cache.

Instead, we can use multi-stage builds to remove obsolete files, whilst still taking advantage of the build cache.

Using Multi-stage Builds

Multi-stage build is a Dockerfile feature introduced in v17.05 that allows you to specify multiple images (stages) within the same Dockerfile. More importantly, you are able to COPY build artifacts from one stage to another stage.

Therefore, inside our Dockerfile, we can have a builder stage, where we install dependencies and build our application, splitting that process into multiple instructions to leverage the build cache. Then, we copy only what is needed to run the image from the builder stage to the final image.

FROM node as builder
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]

FROM node
WORKDIR /root/
COPY --from=builder /root/ ./
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

Note that we specified a --from option to COPY to signify that it should copy from the builder stage, and not from the build context.

Using multi-stage builds allows us to leverage the build cache, whilst keeping our final image size small.

If we build our image again, you'll see that we'ved saved ~9MB from the image.

$ docker build -t demo-frontend:multi-stage .
$ docker images
REPOSITORY     TAG          IMAGE ID      SIZE
demo-frontend  multi-stage  cf57206dc983  930MB
<none>         <none>       8874c0fec4c9  939MB

The <none>:<none> image is the intermediate builder stage image, which can be safely discarded, although doing so will also remove the cached layers.

We will outline a way to easily clean up intermediate images later in this article.

Using a lighter base image

Even though we've gotten rid of unnecessary build artifacts, 9MB is not a lot relative to the size of the image. We can reduce the size of the image more significantly by using a lighter base image.

At the moment, we are using the node base image, which is, itself, 904MB.

$ docker images node
REPOSITORY  TAG     IMAGE ID      SIZE
node        latest  a9c1445cbd52  904MB

This means no matter how much we minimize our demo-frontend image, it will never get smaller than 904MB. So why is it so large?

If we look inside the Dockerfile for the node base image, we'll find that it's based on the buildpack-deps image, which contains a large number of common Debian packages, including build tools, system libraries, and system utilities. We might need these utilities when building our demo-frontend image, but we won't need them to run our node process.

Fortunately, there's a variant of the node image called node:alpine. The node:alpine image is based off the alpine (Linux Alpine) image, which is a much smaller base image (5.53MB).

$ docker images alpine
REPOSITORY  TAG     IMAGE ID      SIZE
alpine      latest  5cb3aa00f899  5.53MB

The alpine image doesn't include any build tools or libraries (it doesn't even have Bash!), allowing it to have a much smaller image size than the node:latest image.

$ docker images node
REPOSITORY  TAG     IMAGE ID      SIZE
node        slim    e52c23bbdd87  148MB
node        latest  a9c1445cbd52  904MB
node        alpine  953c516e1466  76.1MB

Therefore, we should update our Dockerfile to use node:alpine instead of node for our final image (but keep using node for our builder stage).

FROM node as builder
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]

FROM node:alpine
WORKDIR /root/
COPY --from=builder /root/ ./
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

When we build our image again, you should notice the size of the image decreased drastically!

$ docker images demo-frontend:alpine
REPOSITORY     TAG     IMAGE ID      SIZE
demo-frontend  alpine  97373fdcb697  102MB

Removing Intermediate Images

Mutli-stage build is a great feature, as it allows you to keep images small and make use of the build cache. But this also means a lot of intermediate images are going to be generated.

These intermediate images are a type of dangling image, which are images that does not have a name. Generally, you should keep these dangling images, as they are the basis of the build cache. But having them littered across the output of your docker images output can be annoying; or if you are maintaining a CI/CD server, you may also want to clean up dangling images regularly.

You can output a list of dangling images by using the --filter flag of docker images.

$ docker images --filter dangling=true
REPOSITORY  TAG     IMAGE ID      SIZE
<none>      <none>  8874c0fec4c9  939MB

And you can remove them by running docker rmi $(docker images --filter dangling=true --quiet). However, this indiscriminately removes all dangling images. What if you just want to remove dangling images generated from a certain build? Enter labels!

Using labels (LABEL)

The LABEL instruction allows you to specify metadata (as key-value pairs) to your image. You can use labels to:

  • document contact details of the author and/or maintainer of the image (this replaces the deprecated MAINTAINER instruction)
  • the build date of the image
  • add licensing information

In our case, we can use labels to mark an image as intermediate and belonging to the demo-frontend build.

FROM node as builder
LABEL name=demo-frontend
LABEL intermediate=true
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]

FROM node:alpine
LABEL name=demo-frontend
WORKDIR /root/
COPY --from=builder /root/ ./
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

Now, when we build our image, it will be labelled, and we can filter the output of docker images using the labels.

$ docker build -t demo-frontend:labels .
$ docker images --filter label=name=demo-frontend
REPOSITORY     TAG     IMAGE ID      SIZE
demo-frontend  labels  6965537afe54  102MB
<none>         <none>  0cbce2a3844b  939MB

It also allows us to remove the intermediate image of our demo-frontend build(s) by running docker rmi $(docker images --filter label=name=demo-frontend --filter label=intermediate=true --quiet).

Adding Semantics to Labels

Above, we picked two strings - name and intermediate - as our label key values. This is fine for now, but what if the author of another Docker image decides to use these labels as well? This is why Docker recommends that all LABEL instructions should have keys that are namespaced with the reverse DNS name of a domain that you own. This will avoid clashes in label key names. Therefore, we should update our labels accordingly.

FROM node as builder
LABEL works.buddy.name=demo-frontend
LABEL works.buddy.intermediate=true
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]

FROM node:alpine
LABEL works.buddy.name=demo-frontend
WORKDIR /root/
COPY --from=builder /root/ ./
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

Whilst namespacing prevent label keys from clashing, it lacks a common semantics - how would a user know what works.buddy.intermediate mean? Or whether works.buddy.intermediate conveys the same meaning as com.acme.intermediate?

In the past, Docker users and organizations have came up with multiple conventions for imposing semantics to label key names, including:

However, both have been superseded by annotations defined in the Open Container Initiative (OCI) Image Format Specification. This specification defines multiple pre-defined annotation keys, each prefixed with the org.opencontainers.image. namespace.

For example, the annotations specification specifies that the org.opencontainers.image.title label be used to specify the "human-readable title of the image", and the org.opencontainers.image.vendor label be used for the "name of the distributing entity, organization or individual".

So let's update the label keys in our Dockerfile with these standardized label keys wherever possible.

FROM node as builder
LABEL org.opencontainers.image.vendor=demo-frontend
LABEL works.buddy.intermediate=true
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
RUN ["/bin/bash", "-c", "find . ! -name dist ! -name node_modules -maxdepth 1 -mindepth 1 -exec rm -rf {} \\;"]

FROM node:alpine
LABEL org.opencontainers.image.vendor=demo-frontend
LABEL org.opencontainers.image.title="Buddy Team"
WORKDIR /root/
COPY --from=builder /root/ ./
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080

Linting your Dockerfile

The last thing we will do in this article is to lint our Dockerfile. There are multiple tools available for linting Dockerfiles:

In this article, we will use hadolint, with a brief mention of dockerfilelint at the end.

Hadolint

Hadolint parses the Dockerfile into an abstract syntax tree (AST), which is a structured object representing the contents of the Dockerfile. It is similar in concept to how your browser parses HTML source code into the Document Object Model (DOM).

Hadolint will then test the AST against a list of rules to detect places in the Dockerfile which does not follow best practices. Let's run it against our Dockerfile to see where we can improve.

The easiest way to run hadolint is by running the hadolint/hadolint image using Docker.

$ docker pull hadolint/hadolint
$ docker run --rm -i hadolint/hadolint < Dockerfile
/dev/stdin:1 DL3006 Always tag the version of an image explicitly

Hadolint displayed the DL3006 error, which says that the first line (/dev/stdin:1) of the Dockerfile should use a tagged image. So let's update our FROM instruction to give our node base image the latest tag.

FROM node:latest as builder
LABEL org.opencontainers.image.vendor=demo-frontend
...

We can run hadolint again; this time, it gives another error.

$ docker run --rm -i hadolint/hadolint < Dockerfile
/dev/stdin:1 DL3007 Using latest is prone to errors if the image will ever update. Pin the version explicitly to a release tag

The DL3007 error informs us that we shouldn't use the latest tag, as node:latest can reference different images over time. Instead, we should pick a more specific tag. We could be as specific as possible and use a tag like 10.15.3-stretch. However, I've found using the lts tag is often sufficient, as it follows the latest Long Term Support (LTS) version of Node.js.

FROM node:lts as builder
LABEL org.opencontainers.image.vendor=demo-frontend
...

Now, when we run hadolint again, it doesn't generate any errors anymore!

In general, where using hadolint, there are two types of rules:

  • Rules which begins with DL implies errors in the Dockerfile syntax
  • Rules which begins with SC implies errors in some of the script(s) you specified within the Dockerfile. These are picked up by another tool called ShellCheck, which performs static analysis on your shell scripts.

Using a Second Linter

Linting your Dockerfile ensures you are following best practices; but you don't have to limit yourself to a single linter! For instance, you can also use the dockerfilelint npm package alongside hadolint.

Using dockerfilelint with our pre-linted Dockerfile yields a similar result, although dockerfilelint outputs in CLI format by default, which might be better for everyday use.

$ dockerfilelint Dockerfile 

File:   Dockerfile
Issues: 1

Line 1: FROM node as builder
Issue  Category  Title               Description
    1  Clarity   Base Image Missing  Base images should specify a tag to use.
                 Tag

dockerfilelint can also output as JSON, which may be advantagous for programmatic use.

$ dockerfilelint Dockerfile -o json | jq .files[0].issues
[
  {
    "line": "1",
    "content": "FROM node as builder",
    "category": "Clarity",
    "title": "Base Image Missing Tag",
    "description": "Base images should specify a tag to use."
  }
]

When the issues are fixed, this is the output from dockerfilelint.

$ dockerfilelint Dockerfile

File:   Dockerfile
Issues: None found 👍

Using multiple linters have the advantage of discovering errors missed by other linters. To finish up, let's build our image using the (double-)linted Dockerfile!

$ docker build -t demo-frontend:oci-annotations .

Next Steps

In this article, we have only covered the basics. If you'd like to learn more, I'd recommend you watch a talk I gave at the London Node User Group (LNUG) back in October 2018, titled Dockerizing JavaScript Applications.

An important aspect we haven't covered is security.

Unbeknownst to you, we've already made strides in securing your Docker image! When we moved from our node image to node:alpine, that has already improved the security of the image.

This is because everything inside a container has the potential to be exploited in an attack. By reducing the number of libraries and tools, we reduce the potential attack surface. The same principle applies when we reduced the number of running processes in our container.

However, there are a lot more we can, and should, do to secure our image. So stay tuned for our next article - Securing our Docker image - which builds on top of this one.

Next in series: Securing our Docker image

Daniel Li

Daniel Li

Staff Software Engineer @ Zinc Work

Daniel Li is a DevOps Engineer and Fullstack Node.js Developer, working with AWS, Ansible, Terraform, Docker, Kubernetes, and Node.js. He is the author of the book Building Enterprise JavaScript Applications, published by Packt.

Read similar articles

How to Make a Discord Bot in Node.js for Beginners

Check out our tutorial
How to Make a Discord Bot in Node.js for BeginnersHow to Make a Discord Bot in Node.js for Beginners

Integration testing for AWS Lambda in Go with Docker-compose

Check out our tutorial
Integration testing for AWS Lambda in Go with Docker-composeIntegration testing for AWS Lambda in Go with Docker-compose